How to Read This Post
Each scenario shows a diagram first, then a short note on why the pattern matters. Complexity increases as you scroll.
| Algorithm | Approach | Strength | Trade-off |
|---|---|---|---|
| Raft | Leader-based | Understandable, widely adopted | Leader bottleneck |
| Paxos | Proposer-based | Formally proven, flexible | Notoriously hard to implement |
| Multi-Paxos | Optimized Paxos | Fewer round-trips with stable leader | Complex recovery |
| ZAB | Leader-based | Ordered broadcast (ZooKeeper) | Tightly coupled to ZK |
Level 1 — Why Consensus?
1. The Agreement Problem
x = 5 ✓"] C -->|"SET x=5"| N2["Node B
x = 5 ✓"] C -->|"SET x=5"| N3["Node C
⚡ crashed"] N1 -->|"what is x?"| Q{{"Do all nodes
agree x=5?"}} N2 -->|"what is x?"| Q N3 -->|"???"| Q Q -->|"With consensus"| YES["All agree on x=5
even with failures"] Q -->|"Without consensus"| NO["Inconsistent state
split brain possible"] style C fill:#4a9eff,stroke:#2d7ed8,color:#fff style N1 fill:#51cf66,stroke:#37b24d,color:#fff style N2 fill:#51cf66,stroke:#37b24d,color:#fff style N3 fill:#ff6b6b,stroke:#d44,color:#fff style Q fill:#ffd43b,stroke:#f59f00,color:#333 style YES fill:#51cf66,stroke:#37b24d,color:#fff style NO fill:#ff6b6b,stroke:#d44,color:#fff
The core problem. Distributed nodes must agree on a single value even when some nodes crash or messages are delayed. Without consensus, you get split-brain — two parts of the system believing different things.
2. The FLP Impossibility
Network"] --- CENTER{{"Pick any two"}} F["Fault
Tolerance"] --- CENTER D["Deterministic
Termination"] --- CENTER end CENTER -->|"Raft, Paxos"| SOL1["Async + Fault Tolerant
✗ May not terminate
(use timeouts in practice)"] CENTER -->|"2PC"| SOL2["Async + Deterministic
✗ Blocks on failure
(coordinator crash = stuck)"] CENTER -->|"Synchronous protocols"| SOL3["Fault Tolerant + Deterministic
✗ Requires sync network
(unrealistic in practice)"] style A fill:#4a9eff,stroke:#2d7ed8,color:#fff style F fill:#ff6b6b,stroke:#d44,color:#fff style D fill:#51cf66,stroke:#37b24d,color:#fff style CENTER fill:#ffd43b,stroke:#f59f00,color:#333 style SOL1 fill:#a29bfe,stroke:#6c5ce7,color:#fff style SOL2 fill:#a29bfe,stroke:#6c5ce7,color:#fff style SOL3 fill:#a29bfe,stroke:#6c5ce7,color:#fff
Fischer, Lynch, Paterson (1985). No deterministic algorithm can guarantee consensus in an asynchronous system with even one faulty process. Practical systems like Raft use randomized timeouts to work around this theoretical limit.
3. Leader vs Leaderless Consensus
| Aspect | Leader-Based (Raft, Paxos) | Leaderless (Gossip, CRDT) |
|---|---|---|
| Ordering | Strong, total order | Eventual, partial order |
| Write path | Single leader | Any node |
| Failure handling | Leader election | Automatic, no single point |
| Use case | Metadata, coordination | AP systems, counters |
Level 2 — Raft Deep Dive
4. Raft Leader Election
(Follower) participant S2 as Server 2
(Follower) participant S3 as Server 3
(Follower) participant S4 as Server 4
(Follower) participant S5 as Server 5
(Follower) Note over S1,S5: Term 1 — Leader crashed Note over S3: Election timeout!
150-300ms randomized S3->>S3: Increment term → 2
Vote for self par RequestVote RPCs S3->>S1: RequestVote(term=2) S3->>S2: RequestVote(term=2) S3->>S4: RequestVote(term=2) S3->>S5: RequestVote(term=2) end S1-->>S3: Vote granted S2-->>S3: Vote granted S4-->>S3: Vote granted Note over S3: 4 votes (self + 3)
Majority of 5 ✓ S3->>S3: Become Leader par Heartbeats establish authority S3->>S1: AppendEntries(heartbeat) S3->>S2: AppendEntries(heartbeat) S3->>S4: AppendEntries(heartbeat) S3->>S5: AppendEntries(heartbeat) end Note over S1,S5: Term 2 — Server 3 is Leader
Raft leader election. When a follower’s election timer expires (randomized 150-300ms), it becomes a candidate and requests votes. First candidate to get a majority wins. Randomized timeouts prevent perpetual split votes.
5. Raft Log Replication
Log[8] = {term:3, SET user=alice} par Replicate to all followers L->>F1: AppendEntries(index=8, term=3) L->>F2: AppendEntries(index=8, term=3) L->>F3: AppendEntries(index=8, term=3) L->>F4: AppendEntries(index=8, term=3) end F1-->>L: Success (log matched) F2-->>L: Success (log matched) F3-->>L: Success (log matched) Note over F4: Slow / partitioned Note over L: 3/4 followers + self = 4/5
Majority reached → COMMIT L->>L: Apply Log[8] to state machine L-->>C: OK: user=alice committed par Notify commit index L->>F1: AppendEntries(commitIndex=8) L->>F2: AppendEntries(commitIndex=8) L->>F3: AppendEntries(commitIndex=8) L->>F4: AppendEntries(commitIndex=8) end F1->>F1: Apply Log[8] F2->>F2: Apply Log[8] F3->>F3: Apply Log[8]
Raft log replication. The leader appends to its local log, replicates to followers, and commits once a majority acknowledges. Committed entries are guaranteed durable — even if the leader crashes, the new leader will have this entry.
6. Raft Term Progression
start election SplitVote --> Timeout2: no majority } state T3 { [*] --> Election3: new election Election3 --> Leader_B: S3 wins Leader_B --> Heartbeats3: normal operation Heartbeats3 --> Leader_B } state T4 { [*] --> Partition: network split Partition --> Leader_C: S5 elected
in minority partition } T1 --> T2: leader failure T2 --> T3: split vote → re-election T3 --> T4: network partition
Raft terms. Each term has at most one leader. Terms act as a logical clock — any server that sees a higher term immediately steps down. Split votes cause empty terms, resolved by re-election with new randomized timeouts.
7. Raft Network Partition
term 4, but can't get majority"] MN2["S5"] end end During -->|"Partition heals"| After subgraph After["After Partition Heals"] direction LR AL["S1 (Leader, term 3)
committed entries preserved"] AF1["S2"] AF2["S3"] AF3["S4 steps down
reverts uncommitted entries"] AF4["S5 follows S1"] end style BL fill:#ffd43b,stroke:#f59f00,color:#333 style BF1 fill:#546e7a,stroke:#90a4ae,color:#fff style BF2 fill:#546e7a,stroke:#90a4ae,color:#fff style BF3 fill:#546e7a,stroke:#90a4ae,color:#fff style BF4 fill:#546e7a,stroke:#90a4ae,color:#fff style ML fill:#ffd43b,stroke:#f59f00,color:#333 style MF1 fill:#51cf66,stroke:#37b24d,color:#fff style MF2 fill:#51cf66,stroke:#37b24d,color:#fff style MN1 fill:#ff6b6b,stroke:#d44,color:#fff style MN2 fill:#ff6b6b,stroke:#d44,color:#fff style AL fill:#ffd43b,stroke:#f59f00,color:#333 style AF1 fill:#51cf66,stroke:#37b24d,color:#fff style AF2 fill:#51cf66,stroke:#37b24d,color:#fff style AF3 fill:#546e7a,stroke:#90a4ae,color:#fff style AF4 fill:#546e7a,stroke:#90a4ae,color:#fff
Raft partition safety. Only the partition with a majority can elect a leader and commit entries. The minority partition’s candidate can never get enough votes. When the partition heals, stale leaders discover the higher term and step down, reverting any uncommitted entries.
Level 3 — Paxos
8. Basic Paxos (Single Decree)
Free to propose any value Note over P,A3: Phase 2 — Accept P->>A1: Accept(n=1, v="x=5") P->>A2: Accept(n=1, v="x=5") P->>A3: Accept(n=1, v="x=5") A1-->>P: Accepted(n=1) A2-->>P: Accepted(n=1) A3-->>P: Accepted(n=1) Note over P: Majority accepted
Value "x=5" is CHOSEN par Notify learners A1->>L: Accepted(n=1, v="x=5") A2->>L: Accepted(n=1, v="x=5") end Note over L: Learned: x=5
Basic Paxos. Two phases — Prepare gets promises from a majority of acceptors, then Accept proposes a value. If a majority accepts, the value is chosen. The key invariant: a promise means “I won’t accept any proposal numbered less than n.”
9. Paxos with Competing Proposers
A2 broke promise to P1 P1->>A1: Accept(n=1, v="x=5") P1->>A2: Accept(n=1, v="x=5") A1-->>P1: Accepted(n=1) A2-->>P1: REJECTED (promised n=2) Note over P1: Only 1 accept
No majority — FAILED P2->>A2: Accept(n=2, v="x=7") P2->>A3: Accept(n=2, v="x=7") A2-->>P2: Accepted(n=2) A3-->>P2: Accepted(n=2) Note over P2: Majority accepted
Value "x=7" is CHOSEN
Dueling proposers. When two proposers compete, the higher proposal number wins. Acceptor 2 breaks its promise to Proposer 1 when it promises Proposer 2’s higher-numbered proposal. This can cause livelock — each proposer keeps preempting the other. Multi-Paxos solves this with a stable leader.
10. Multi-Paxos Optimization
(Distinguished Proposer) participant A1 as Acceptor 1 participant A2 as Acceptor 2 participant A3 as Acceptor 3 Note over L,A3: Phase 1 — One-time Prepare (leadership) L->>A1: Prepare(n=1) L->>A2: Prepare(n=1) L->>A3: Prepare(n=1) A1-->>L: Promise(n=1) A2-->>L: Promise(n=1) A3-->>L: Promise(n=1) Note over L: Leader established
Skip Phase 1 for subsequent rounds rect rgb(30, 60, 30) Note over C,A3: Round 1 — Accept only (Phase 2) C->>L: SET x=5 L->>A1: Accept(n=1, slot=1, v="x=5") L->>A2: Accept(n=1, slot=1, v="x=5") L->>A3: Accept(n=1, slot=1, v="x=5") A1-->>L: Accepted A2-->>L: Accepted L-->>C: OK: x=5 committed end rect rgb(30, 30, 60) Note over C,A3: Round 2 — Accept only (Phase 2) C->>L: SET y=10 L->>A1: Accept(n=1, slot=2, v="y=10") L->>A2: Accept(n=1, slot=2, v="y=10") L->>A3: Accept(n=1, slot=2, v="y=10") A1-->>L: Accepted A2-->>L: Accepted L-->>C: OK: y=10 committed end rect rgb(60, 30, 30) Note over C,A3: Round 3 — Accept only (Phase 2) C->>L: SET z=15 L->>A1: Accept(n=1, slot=3, v="z=15") L->>A2: Accept(n=1, slot=3, v="z=15") L->>A3: Accept(n=1, slot=3, v="z=15") A1-->>L: Accepted A2-->>L: Accepted L-->>C: OK: z=15 committed end
Multi-Paxos. A stable leader runs Phase 1 once, then handles all subsequent client requests with Phase 2 only — one round-trip instead of two. This is how Paxos works in practice. Google’s Chubby and Spanner both use Multi-Paxos variants.
Level 4 — Advanced Consensus
11. Raft vs Paxos Architecture
handles ALL operations"] RL --> RLR["Log Replication
leader → followers"] RL --> RLE["Leader Election
randomized timeouts"] RL --> RSM["Safety
election restriction +
log matching"] end subgraph Paxos["Paxos — Designed for Correctness"] direction TB PB["Basic Paxos
single value agreement"] PB --> PMP["Multi-Paxos
log of values"] PB --> PFP["Flexible Paxos
non-intersecting quorums"] PB --> PEP["EPaxos
leaderless, commutative"] end style RL fill:#4a9eff,stroke:#2d7ed8,color:#fff style RLR fill:#4a9eff,stroke:#2d7ed8,color:#fff style RLE fill:#4a9eff,stroke:#2d7ed8,color:#fff style RSM fill:#4a9eff,stroke:#2d7ed8,color:#fff style PB fill:#a29bfe,stroke:#6c5ce7,color:#fff style PMP fill:#a29bfe,stroke:#6c5ce7,color:#fff style PFP fill:#a29bfe,stroke:#6c5ce7,color:#fff style PEP fill:#a29bfe,stroke:#6c5ce7,color:#fff
| Aspect | Raft | Paxos |
|---|---|---|
| Design goal | Understandability | Formal correctness |
| Leader | Required, strong | Optional (Multi-Paxos uses one) |
| Log structure | Contiguous, no gaps | Slots can have gaps |
| Membership change | Joint consensus | Reconfiguration protocol |
| Implementations | etcd, Consul, TiKV | Chubby, Spanner, Megastore |
| Learning curve | Moderate | Very steep |
12. Consensus in Practice
Kubernetes config store"] CONSUL["Consul
Service discovery"] TIKV["TiKV
Distributed KV store"] CRDB["CockroachDB
Distributed SQL"] end subgraph Paxos_Systems["Paxos-Based Systems"] CHUBBY["Chubby
Google's lock service"] SPANNER["Spanner
Google's global DB"] MEGA["Megastore
Google's entity store"] end subgraph ZAB_Systems["ZAB-Based Systems"] ZK["ZooKeeper
Distributed coordination"] KAFKA_META["Kafka (KRaft)
Metadata management"] end subgraph Use_Cases["What Gets Decided"] direction LR UC1["Leader election"] UC2["Configuration changes"] UC3["Distributed locks"] UC4["Transaction commit"] end Raft_Systems --> Use_Cases Paxos_Systems --> Use_Cases ZAB_Systems --> Use_Cases style ETCD fill:#4a9eff,stroke:#2d7ed8,color:#fff style CONSUL fill:#4a9eff,stroke:#2d7ed8,color:#fff style TIKV fill:#4a9eff,stroke:#2d7ed8,color:#fff style CRDB fill:#4a9eff,stroke:#2d7ed8,color:#fff style CHUBBY fill:#a29bfe,stroke:#6c5ce7,color:#fff style SPANNER fill:#a29bfe,stroke:#6c5ce7,color:#fff style MEGA fill:#a29bfe,stroke:#6c5ce7,color:#fff style ZK fill:#27ae60,stroke:#1e8449,color:#fff style KAFKA_META fill:#27ae60,stroke:#1e8449,color:#fff style UC1 fill:#ffd43b,stroke:#f59f00,color:#333 style UC2 fill:#ffd43b,stroke:#f59f00,color:#333 style UC3 fill:#ffd43b,stroke:#f59f00,color:#333 style UC4 fill:#ffd43b,stroke:#f59f00,color:#333
Consensus everywhere. Every distributed database, every coordination service, every leader election mechanism is built on consensus. Raft dominates open-source (simpler to implement and debug). Paxos dominates Google-scale systems (more flexible, more optimizable).
TL;DR — When to Use What
| Scenario | Algorithm | Why |
|---|---|---|
| New distributed system | Raft | Well-documented, many libraries, easier to debug |
| Google-scale global DB | Multi-Paxos | Skip Phase 1 optimization, flexible quorums |
| ZooKeeper-based infra | ZAB | Purpose-built for ordered broadcast |
| Leaderless / low-latency | EPaxos | No leader bottleneck, but complex |
| Learning consensus | Raft | Designed specifically to be understandable |
- Building a new system? → Start with Raft. Libraries exist for every language.
- Need sub-millisecond consensus? → Look at Multi-Paxos or EPaxos.
- Already using ZooKeeper? → ZAB handles it. Don’t reinvent the wheel.
- Learning distributed systems? → Read the Raft paper. It was literally designed to be taught.