How to Read This Post
Each scenario shows a diagram first, then a short note on why the pattern matters. Complexity increases as you scroll.
| Strategy | Category | Strength | Trade-off |
|---|---|---|---|
| Leader-Follower | Replication | Simple, strong consistency for reads | Write bottleneck at leader |
| Multi-Leader | Replication | Multi-DC writes | Conflict resolution needed |
| Leaderless (Quorum) | Replication | No single point of failure | Tunable consistency |
| Range Partitioning | Partitioning | Range scans, sorted data | Hotspot risk |
| Hash Partitioning | Partitioning | Even distribution | No range queries |
| Consistent Hashing | Partitioning | Minimal redistribution | Virtual nodes needed |
Level 1 — Foundations
1. Single Node Storage
All clients BLOCKED"] style C1 fill:#4a9eff,stroke:#2d7ed8,color:#fff style C2 fill:#4a9eff,stroke:#2d7ed8,color:#fff style C3 fill:#4a9eff,stroke:#2d7ed8,color:#fff style DB fill:#ff6b6b,stroke:#d44,color:#fff style LOST fill:#546e7a,stroke:#90a4ae,color:#fff
Single point of failure. One database handles everything — simple but fatal. If it crashes, all data is lost and all clients are blocked. Every distributed storage pattern exists to solve this problem.
2. CAP Theorem
Pick two during partition"}} CAP --> CP["CP — Consistency + Partition Tolerance
Reject writes during partition"] CAP --> AP["AP — Availability + Partition Tolerance
Accept writes, resolve conflicts later"] CAP --> CA["CA — Consistency + Availability
Only if no partition (single node)"] CP --> CP_EX["HBase, MongoDB, etcd
ZooKeeper, Redis Cluster"] AP --> AP_EX["Cassandra, DynamoDB
CouchDB, Riak"] CA --> CA_EX["PostgreSQL (single node)
MySQL (single node)"] style CAP fill:#ffd43b,stroke:#f59f00,color:#333 style CP fill:#4a9eff,stroke:#2d7ed8,color:#fff style AP fill:#51cf66,stroke:#37b24d,color:#fff style CA fill:#a29bfe,stroke:#6c5ce7,color:#fff style CP_EX fill:#546e7a,stroke:#90a4ae,color:#fff style AP_EX fill:#546e7a,stroke:#90a4ae,color:#fff style CA_EX fill:#546e7a,stroke:#90a4ae,color:#fff
CAP theorem. During a network partition, you must choose: consistency (all nodes see the same data) or availability (every request gets a response). CA systems don’t exist in distributed form — partitions always happen eventually.
3. Read/Write Trade-offs
(90% reads)"| RH["Add read replicas"] WL -->|"Write-heavy
(90% writes)"| WH["Partition data across nodes"] WL -->|"Mixed"| MX["Replicate + Partition"] RH --> RH_S["1 Leader + N Followers
Scale reads horizontally"] WH --> WH_S["Shard by key
Each shard handles its writes"] MX --> MX_S["Partition for writes
Replicate each partition for reads"] style WL fill:#ffd43b,stroke:#f59f00,color:#333 style RH fill:#4a9eff,stroke:#2d7ed8,color:#fff style WH fill:#ff6b6b,stroke:#d44,color:#fff style MX fill:#a29bfe,stroke:#6c5ce7,color:#fff style RH_S fill:#51cf66,stroke:#37b24d,color:#fff style WH_S fill:#51cf66,stroke:#37b24d,color:#fff style MX_S fill:#51cf66,stroke:#37b24d,color:#fff
Design for your workload. Read-heavy? Add replicas. Write-heavy? Partition. Most real systems need both — partition data for write throughput, replicate each partition for read scaling and durability.
Level 2 — Replication
4. Leader-Follower Replication
(read-write)"] CR1["Client (reads)"] --> F1["Follower 1
(read-only)"] CR2["Client (reads)"] --> F2["Follower 2
(read-only)"] CR3["Client (reads)"] --> L L -->|"replication stream"| F1 L -->|"replication stream"| F2 L -->|"replication stream"| F3["Follower 3
(standby)"] F3 -->|"if leader fails"| PROMOTE["Promoted to Leader"] style CW fill:#4a9eff,stroke:#2d7ed8,color:#fff style CR1 fill:#4a9eff,stroke:#2d7ed8,color:#fff style CR2 fill:#4a9eff,stroke:#2d7ed8,color:#fff style CR3 fill:#4a9eff,stroke:#2d7ed8,color:#fff style L fill:#ffd43b,stroke:#f59f00,color:#333 style F1 fill:#51cf66,stroke:#37b24d,color:#fff style F2 fill:#51cf66,stroke:#37b24d,color:#fff style F3 fill:#546e7a,stroke:#90a4ae,color:#fff style PROMOTE fill:#ff6b6b,stroke:#d44,color:#fff
Leader-follower (master-slave). All writes go through a single leader. Followers replicate the write-ahead log and serve reads. PostgreSQL, MySQL, MongoDB all support this. Simple, but the leader is the write bottleneck and a failover target.
5. Synchronous vs Asynchronous Replication
Higher latency (network RTT) end rect rgb(60, 30, 30) Note over C,F2: Asynchronous Replication C->>L: INSERT user=bob L->>L: Write to local WAL L-->>C: OK (committed) Note over C: Immediate response
Lower latency L->>F1: Replicate (background) L->>F2: Replicate (background) Note over F1,F2: Followers may lag
Risk: data loss if leader crashes end
Sync vs async — the durability-latency trade-off. Synchronous replication guarantees no data loss but adds network latency to every write. Asynchronous returns immediately but followers can lag — if the leader crashes before replication, writes are lost. Semi-synchronous (ack from at least one follower) is the common middle ground.
6. Multi-Leader Replication
Last-write-wins / CRDTs / Custom"] style C1 fill:#4a9eff,stroke:#2d7ed8,color:#fff style C2 fill:#4a9eff,stroke:#2d7ed8,color:#fff style C3 fill:#4a9eff,stroke:#2d7ed8,color:#fff style L1 fill:#ffd43b,stroke:#f59f00,color:#333 style L2 fill:#ffd43b,stroke:#f59f00,color:#333 style L3 fill:#ffd43b,stroke:#f59f00,color:#333 style F1A fill:#51cf66,stroke:#37b24d,color:#fff style F1B fill:#51cf66,stroke:#37b24d,color:#fff style F2A fill:#51cf66,stroke:#37b24d,color:#fff style F2B fill:#51cf66,stroke:#37b24d,color:#fff style F3A fill:#51cf66,stroke:#37b24d,color:#fff style CONFLICT fill:#ff6b6b,stroke:#d44,color:#fff
Multi-leader replication. Each datacenter has its own leader that accepts writes locally with low latency. Leaders replicate to each other asynchronously. The catch: two leaders can modify the same row simultaneously — you need a conflict resolution strategy (last-write-wins, CRDTs, or application-level merging).
7. Leaderless Replication (Quorum)
W + R > N ensures overlap rect rgb(30, 60, 30) Note over C,N3: Write (W=2 required) C->>N1: Write x=5 (v2) C->>N2: Write x=5 (v2) C->>N3: Write x=5 (v2) N1-->>C: ACK N2-->>C: ACK Note over N3: Temporarily unavailable Note over C: 2/3 ACKs = W satisfied ✓ end rect rgb(30, 30, 60) Note over C,N3: Read (R=2 required) C->>N1: Read x C->>N2: Read x C->>N3: Read x N1-->>C: x=5 (v2) N2-->>C: x=5 (v2) N3-->>C: x=3 (v1, stale!) Note over C: 2/3 responses have v2
Return x=5 (latest version) end rect rgb(60, 30, 30) Note over C,N3: Read Repair C->>N3: Write x=5 (v2) N3-->>C: ACK Note over N1,N3: All nodes now consistent end
Leaderless quorum reads/writes (Dynamo-style). Write to W nodes, read from R nodes. If W + R > N, at least one node in the read set has the latest write. The client picks the newest version and repairs stale nodes. Cassandra, DynamoDB, and Riak use this approach.
Level 3 — Partitioning
8. Range-Based Partitioning
A-F
apple, banana, cherry"] R -->|"keys G-M"| N2["Node 2
G-M
grape, kiwi, lemon"] R -->|"keys N-Z"| N3["Node 3
N-Z
orange, peach, tomato"] subgraph Problem["Hotspot Problem"] direction LR TS["Time-series keys:
2024-03-01, 2024-03-02..."] TS --> HOT["All writes hit
the SAME partition"] end style R fill:#ffd43b,stroke:#f59f00,color:#333 style N1 fill:#4a9eff,stroke:#2d7ed8,color:#fff style N2 fill:#51cf66,stroke:#37b24d,color:#fff style N3 fill:#a29bfe,stroke:#6c5ce7,color:#fff style TS fill:#546e7a,stroke:#90a4ae,color:#fff style HOT fill:#ff6b6b,stroke:#d44,color:#fff
Range partitioning. Keys are split into contiguous ranges — simple and supports efficient range scans (SELECT * WHERE key BETWEEN 'A' AND 'F'). The danger: sequential keys (timestamps, auto-increment IDs) create hotspots where all writes hammer a single partition.
9. Hash-Based Partitioning
hash=7
7 mod 3 = 1"] --> BN1["Node 1"] K2["key: bob
hash=12
12 mod 3 = 0"] --> BN0["Node 0"] K3["key: charlie
hash=5
5 mod 3 = 2"] --> BN2["Node 2"] end Before -->|"Add Node 3"| After subgraph After["4 Nodes — hash(key) mod 4"] direction LR K4["key: alice
hash=7
7 mod 4 = 3"] --> AN3["Node 3 ← MOVED!"] K5["key: bob
hash=12
12 mod 4 = 0"] --> AN0["Node 0 (same)"] K6["key: charlie
hash=5
5 mod 4 = 1"] --> AN1["Node 1 ← MOVED!"] end After --> PROBLEM["Adding 1 node reshuffles
~75% of all keys!"] style K1 fill:#4a9eff,stroke:#2d7ed8,color:#fff style K2 fill:#4a9eff,stroke:#2d7ed8,color:#fff style K3 fill:#4a9eff,stroke:#2d7ed8,color:#fff style BN0 fill:#51cf66,stroke:#37b24d,color:#fff style BN1 fill:#51cf66,stroke:#37b24d,color:#fff style BN2 fill:#51cf66,stroke:#37b24d,color:#fff style K4 fill:#4a9eff,stroke:#2d7ed8,color:#fff style K5 fill:#4a9eff,stroke:#2d7ed8,color:#fff style K6 fill:#4a9eff,stroke:#2d7ed8,color:#fff style AN0 fill:#51cf66,stroke:#37b24d,color:#fff style AN1 fill:#ffd43b,stroke:#f59f00,color:#333 style AN3 fill:#ffd43b,stroke:#f59f00,color:#333 style PROBLEM fill:#ff6b6b,stroke:#d44,color:#fff
Hash mod N partitioning. Hash the key, mod by node count — even distribution, no hotspots. But adding or removing a node changes the mod, reshuffling most keys. This triggers massive data migration. Consistent hashing solves this.
10. Consistent Hashing
position: 90°"] NB["Node B
position: 210°"] NC["Node C
position: 330°"] K1["key1 → 45°
→ Node A (next clockwise)"] K2["key2 → 150°
→ Node B (next clockwise)"] K3["key3 → 270°
→ Node C (next clockwise)"] end subgraph AddNode["Add Node D at 180°"] direction TB NA2["Node A: keeps key1"] ND["Node D (NEW): gets key2
(was Node B's)"] NB2["Node B: loses only key2"] NC2["Node C: keeps key3"] end Ring -->|"only 1/N keys move"| AddNode style NA fill:#4a9eff,stroke:#2d7ed8,color:#fff style NB fill:#51cf66,stroke:#37b24d,color:#fff style NC fill:#a29bfe,stroke:#6c5ce7,color:#fff style K1 fill:#546e7a,stroke:#90a4ae,color:#fff style K2 fill:#546e7a,stroke:#90a4ae,color:#fff style K3 fill:#546e7a,stroke:#90a4ae,color:#fff style NA2 fill:#4a9eff,stroke:#2d7ed8,color:#fff style ND fill:#ffd43b,stroke:#f59f00,color:#333 style NB2 fill:#51cf66,stroke:#37b24d,color:#fff style NC2 fill:#a29bfe,stroke:#6c5ce7,color:#fff
Consistent hashing. Nodes and keys are placed on the same hash ring. Each key maps to the next node clockwise. Adding or removing a node only moves ~1/N of the keys — not all of them. DynamoDB, Cassandra, and Memcached all use consistent hashing.
Level 4 — Advanced Patterns
11. Consistent Hashing with Virtual Nodes
owns 60%"] --- P2["Node B
owns 15%"] --- P3["Node C
owns 25%"] end subgraph WithVnodes["With Virtual Nodes (4 vnodes each)"] direction TB V1["Ring positions:"] VA1["A-1 (30°)"] --- VB1["B-1 (60°)"] --- VC1["C-1 (100°)"] VA2["A-2 (140°)"] --- VB2["B-2 (180°)"] --- VC2["C-2 (220°)"] VA3["A-3 (260°)"] --- VB3["B-3 (290°)"] --- VC3["C-3 (320°)"] VA4["A-4 (350°)"] end NoVnodes -->|"unbalanced!"| FIX["Virtual nodes
spread each physical node
across the ring"] FIX --> WithVnodes WithVnodes --> RESULT["Node A: ~33%
Node B: ~33%
Node C: ~33%"] style P1 fill:#ff6b6b,stroke:#d44,color:#fff style P2 fill:#51cf66,stroke:#37b24d,color:#fff style P3 fill:#4a9eff,stroke:#2d7ed8,color:#fff style VA1 fill:#4a9eff,stroke:#2d7ed8,color:#fff style VA2 fill:#4a9eff,stroke:#2d7ed8,color:#fff style VA3 fill:#4a9eff,stroke:#2d7ed8,color:#fff style VA4 fill:#4a9eff,stroke:#2d7ed8,color:#fff style VB1 fill:#51cf66,stroke:#37b24d,color:#fff style VB2 fill:#51cf66,stroke:#37b24d,color:#fff style VB3 fill:#51cf66,stroke:#37b24d,color:#fff style VC1 fill:#a29bfe,stroke:#6c5ce7,color:#fff style VC2 fill:#a29bfe,stroke:#6c5ce7,color:#fff style VC3 fill:#a29bfe,stroke:#6c5ce7,color:#fff style V1 fill:#546e7a,stroke:#90a4ae,color:#fff style FIX fill:#ffd43b,stroke:#f59f00,color:#333 style RESULT fill:#27ae60,stroke:#1e8449,color:#fff
Virtual nodes. With only physical nodes on the ring, distribution is uneven. Virtual nodes place each physical node at multiple positions (e.g., 256 vnodes per server). This smooths out the distribution. Cassandra uses 256 vnodes per node by default.
12. Replication + Partitioning Combined
P1 (leader), P2, P6"] N2["Node 2
P2 (leader), P3, P1"] N3["Node 3
P3 (leader), P4, P5, P2"] N4["Node 4
P4 (leader), P5 (leader), P6 (leader), P1, P3"] end Data --> Nodes N2 -->|"Node 2 dies"| SURVIVE["P1: still on N1, N4
P2: still on N1, N3
P3: still on N3, N4
No data lost!"] style P1 fill:#4a9eff,stroke:#2d7ed8,color:#fff style P2 fill:#51cf66,stroke:#37b24d,color:#fff style P3 fill:#a29bfe,stroke:#6c5ce7,color:#fff style P4 fill:#ffd43b,stroke:#f59f00,color:#333 style P5 fill:#ff6b6b,stroke:#d44,color:#fff style P6 fill:#27ae60,stroke:#1e8449,color:#fff style N1 fill:#546e7a,stroke:#90a4ae,color:#fff style N2 fill:#546e7a,stroke:#90a4ae,color:#fff style N3 fill:#546e7a,stroke:#90a4ae,color:#fff style N4 fill:#546e7a,stroke:#90a4ae,color:#fff style SURVIVE fill:#51cf66,stroke:#37b24d,color:#fff
Partition + replicate. Data is split into partitions for write throughput, then each partition is replicated across N nodes for durability. Each partition has a leader for writes and followers for reads. Kafka, CockroachDB, and Cassandra all combine partitioning with replication.
13. Rack-Aware Data Placement
Partition A (leader)
Partition C (replica)"] N2["Node 2
Partition B (replica)"] end subgraph Rack2["Rack 2"] N3["Node 3
Partition A (replica)
Partition B (leader)"] N4["Node 4
Partition C (replica)"] end subgraph Rack3["Rack 3"] N5["Node 5
Partition A (replica)
Partition B (replica)"] N6["Node 6
Partition C (leader)"] end end Rack1 -->|"Rack 1 power failure"| SAFE["Partition A: Rack 2 + Rack 3
Partition B: Rack 2 + Rack 3
Partition C: Rack 2 + Rack 3
ALL partitions survive!"] style N1 fill:#4a9eff,stroke:#2d7ed8,color:#fff style N2 fill:#4a9eff,stroke:#2d7ed8,color:#fff style N3 fill:#51cf66,stroke:#37b24d,color:#fff style N4 fill:#51cf66,stroke:#37b24d,color:#fff style N5 fill:#a29bfe,stroke:#6c5ce7,color:#fff style N6 fill:#a29bfe,stroke:#6c5ce7,color:#fff style SAFE fill:#27ae60,stroke:#1e8449,color:#fff
Rack-aware placement. Never put all replicas of a partition on the same rack, zone, or region. If a rack loses power, every partition should still have a majority of replicas alive in other racks. HDFS, Cassandra, and Kafka all support rack-aware replica placement.
TL;DR — When to Use What
| Scenario | Strategy | Why |
|---|---|---|
| Read-heavy, single region | Leader-Follower | Simple, add read replicas as needed |
| Multi-region writes | Multi-Leader | Low-latency local writes |
| Maximum availability | Leaderless Quorum | No single point of failure |
| Sorted range queries | Range Partitioning | Efficient scans, co-located data |
| Even distribution | Hash Partitioning | No hotspots, but no range queries |
| Elastic scaling | Consistent Hashing | Minimal data movement on resize |
| Production systems | All of the above | Partition + replicate + rack-aware |
- Starting simple? → Leader-Follower replication. It covers 80% of use cases.
- Outgrowing a single writer? → Partition your data with consistent hashing.
- Going multi-region? → Multi-Leader or leaderless with tunable consistency.
- Already at scale? → You’re combining all of these. The real skill is knowing which dials to turn.