Vertical Scaling (Scale Up): Add more power (CPU, RAM, SSD) to a single machine. Upgrading from 8-core → 32-core server.
Hardware limit — one machine can only get so big
Horizontal Scaling (Scale Out): Add more machines/nodes to distribute data and load.
Software complexity — requires data distribution & coordination
Scalability → how to spread data/work across multiple machines
Reliability → how to avoid data loss or downtime
Performance → how to parallelize reading/writing/querying
To achieve this, they all use:
Partitioning (or Sharding) → divide the data or work
Replication → duplicate data for reliability
Cluster + Node architecture → machines working together as one logical system
Cluster: A group of servers (nodes) working together as one logical system. [Kafka, Spark, Cassandra, Elasticsearch]
Node: A single machine or instance in the cluster. [Kafka, Spark, Cassandra, Elasticsearch]
Partitioning: Splitting data/work into smaller pieces so each node handles a subset. [Kafka (topics), Cassandra (tables), Elasticsearch (indices), Spark (RDDs)]
Replication: Making copies of data to ensure fault tolerance. [Kafka, Cassandra, Elasticsearch]
Sharding: Usually synonymous with partitioning (especially for databases). Sometimes refers to a physical division of data across machines. [Cassandra, Elasticsearch]
Leader/Follower: One node manages writes (leader), others replicate it (followers). [Kafka, Cassandra, Elasticsearch]
Partitioning (or sharding): Splitting the dataset into smaller, non-overlapping parts (partitions) that can be distributed across multiple nodes for scalability and parallelism.
Replication: Copying each partition to multiple nodes so that if one node fails, data is still available from another node — ensures reliability and fault tolerance.
👉 Partitioning → Split across nodes (scale) → Horizontal Scaling
👉 Replication → Copy across nodes (protect) → Fault Tolerance
Each partition (P0, P1, P2) contains a distinct subset of data.
→ Enables parallelism and scalability.
Each partition is replicated to another node.
→ Ensures fault tolerance.
If Node B fails: Node A and C still contain replicas for all data partitions.