How to Read This Post

Each scenario shows a diagram first, then a short note on why the pattern matters. Complexity increases as you scroll.

Algorithm Approach Strength Trade-off
Raft Leader-based Understandable, widely adopted Leader bottleneck
Paxos Proposer-based Formally proven, flexible Notoriously hard to implement
Multi-Paxos Optimized Paxos Fewer round-trips with stable leader Complex recovery
ZAB Leader-based Ordered broadcast (ZooKeeper) Tightly coupled to ZK

Level 1 — Why Consensus?

1. The Agreement Problem

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TB C[Client] -->|"SET x=5"| N1["Node A
x = 5 ✓"] C -->|"SET x=5"| N2["Node B
x = 5 ✓"] C -->|"SET x=5"| N3["Node C
⚡ crashed"] N1 -->|"what is x?"| Q{{"Do all nodes
agree x=5?"}} N2 -->|"what is x?"| Q N3 -->|"???"| Q Q -->|"With consensus"| YES["All agree on x=5
even with failures"] Q -->|"Without consensus"| NO["Inconsistent state
split brain possible"] style C fill:#4a9eff,stroke:#2d7ed8,color:#fff style N1 fill:#51cf66,stroke:#37b24d,color:#fff style N2 fill:#51cf66,stroke:#37b24d,color:#fff style N3 fill:#ff6b6b,stroke:#d44,color:#fff style Q fill:#ffd43b,stroke:#f59f00,color:#333 style YES fill:#51cf66,stroke:#37b24d,color:#fff style NO fill:#ff6b6b,stroke:#d44,color:#fff

The core problem. Distributed nodes must agree on a single value even when some nodes crash or messages are delayed. Without consensus, you get split-brain — two parts of the system believing different things.


2. The FLP Impossibility

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TB subgraph FLP["FLP Impossibility (1985)"] direction TB A["Asynchronous
Network"] --- CENTER{{"Pick any two"}} F["Fault
Tolerance"] --- CENTER D["Deterministic
Termination"] --- CENTER end CENTER -->|"Raft, Paxos"| SOL1["Async + Fault Tolerant
✗ May not terminate
(use timeouts in practice)"] CENTER -->|"2PC"| SOL2["Async + Deterministic
✗ Blocks on failure
(coordinator crash = stuck)"] CENTER -->|"Synchronous protocols"| SOL3["Fault Tolerant + Deterministic
✗ Requires sync network
(unrealistic in practice)"] style A fill:#4a9eff,stroke:#2d7ed8,color:#fff style F fill:#ff6b6b,stroke:#d44,color:#fff style D fill:#51cf66,stroke:#37b24d,color:#fff style CENTER fill:#ffd43b,stroke:#f59f00,color:#333 style SOL1 fill:#a29bfe,stroke:#6c5ce7,color:#fff style SOL2 fill:#a29bfe,stroke:#6c5ce7,color:#fff style SOL3 fill:#a29bfe,stroke:#6c5ce7,color:#fff

Fischer, Lynch, Paterson (1985). No deterministic algorithm can guarantee consensus in an asynchronous system with even one faulty process. Practical systems like Raft use randomized timeouts to work around this theoretical limit.


3. Leader vs Leaderless Consensus

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart LR subgraph LB["Leader-Based"] direction TB L["Leader"] -->|"replicate"| F1["Follower"] L -->|"replicate"| F2["Follower"] L -->|"replicate"| F3["Follower"] CL["Client"] -->|"all writes"| L end subgraph LL["Leaderless"] direction TB N1["Node A"] <-->|"gossip"| N2["Node B"] N2 <-->|"gossip"| N3["Node C"] N3 <-->|"gossip"| N1 CLL["Client"] -->|"any node"| N1 CLL -->|"any node"| N2 end style L fill:#ffd43b,stroke:#f59f00,color:#333 style F1 fill:#546e7a,stroke:#90a4ae,color:#fff style F2 fill:#546e7a,stroke:#90a4ae,color:#fff style F3 fill:#546e7a,stroke:#90a4ae,color:#fff style CL fill:#4a9eff,stroke:#2d7ed8,color:#fff style N1 fill:#51cf66,stroke:#37b24d,color:#fff style N2 fill:#51cf66,stroke:#37b24d,color:#fff style N3 fill:#51cf66,stroke:#37b24d,color:#fff style CLL fill:#4a9eff,stroke:#2d7ed8,color:#fff
Aspect Leader-Based (Raft, Paxos) Leaderless (Gossip, CRDT)
Ordering Strong, total order Eventual, partial order
Write path Single leader Any node
Failure handling Leader election Automatic, no single point
Use case Metadata, coordination AP systems, counters

Level 2 — Raft Deep Dive

4. Raft Leader Election

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant S1 as Server 1
(Follower) participant S2 as Server 2
(Follower) participant S3 as Server 3
(Follower) participant S4 as Server 4
(Follower) participant S5 as Server 5
(Follower) Note over S1,S5: Term 1 — Leader crashed Note over S3: Election timeout!
150-300ms randomized S3->>S3: Increment term → 2
Vote for self par RequestVote RPCs S3->>S1: RequestVote(term=2) S3->>S2: RequestVote(term=2) S3->>S4: RequestVote(term=2) S3->>S5: RequestVote(term=2) end S1-->>S3: Vote granted S2-->>S3: Vote granted S4-->>S3: Vote granted Note over S3: 4 votes (self + 3)
Majority of 5 ✓ S3->>S3: Become Leader par Heartbeats establish authority S3->>S1: AppendEntries(heartbeat) S3->>S2: AppendEntries(heartbeat) S3->>S4: AppendEntries(heartbeat) S3->>S5: AppendEntries(heartbeat) end Note over S1,S5: Term 2 — Server 3 is Leader

Raft leader election. When a follower’s election timer expires (randomized 150-300ms), it becomes a candidate and requests votes. First candidate to get a majority wins. Randomized timeouts prevent perpetual split votes.


5. Raft Log Replication

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant C as Client participant L as Leader participant F1 as Follower 1 participant F2 as Follower 2 participant F3 as Follower 3 participant F4 as Follower 4 C->>L: Write: SET user=alice Note over L: Append to log (uncommitted)
Log[8] = {term:3, SET user=alice} par Replicate to all followers L->>F1: AppendEntries(index=8, term=3) L->>F2: AppendEntries(index=8, term=3) L->>F3: AppendEntries(index=8, term=3) L->>F4: AppendEntries(index=8, term=3) end F1-->>L: Success (log matched) F2-->>L: Success (log matched) F3-->>L: Success (log matched) Note over F4: Slow / partitioned Note over L: 3/4 followers + self = 4/5
Majority reached → COMMIT L->>L: Apply Log[8] to state machine L-->>C: OK: user=alice committed par Notify commit index L->>F1: AppendEntries(commitIndex=8) L->>F2: AppendEntries(commitIndex=8) L->>F3: AppendEntries(commitIndex=8) L->>F4: AppendEntries(commitIndex=8) end F1->>F1: Apply Log[8] F2->>F2: Apply Log[8] F3->>F3: Apply Log[8]

Raft log replication. The leader appends to its local log, replicates to followers, and commits once a majority acknowledges. Committed entries are guaranteed durable — even if the leader crashes, the new leader will have this entry.


6. Raft Term Progression

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% stateDiagram-v2 state "Term 1" as T1 state "Term 2" as T2 state "Term 3" as T3 state "Term 4" as T4 state T1 { [*] --> Leader_A: S1 elected Leader_A --> Heartbeats1: normal operation Heartbeats1 --> Leader_A } state T2 { [*] --> Election2: S1 crashes Election2 --> SplitVote: S2 and S3 both
start election SplitVote --> Timeout2: no majority } state T3 { [*] --> Election3: new election Election3 --> Leader_B: S3 wins Leader_B --> Heartbeats3: normal operation Heartbeats3 --> Leader_B } state T4 { [*] --> Partition: network split Partition --> Leader_C: S5 elected
in minority partition } T1 --> T2: leader failure T2 --> T3: split vote → re-election T3 --> T4: network partition

Raft terms. Each term has at most one leader. Terms act as a logical clock — any server that sees a higher term immediately steps down. Split votes cause empty terms, resolved by re-election with new randomized timeouts.


7. Raft Network Partition

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TB subgraph Before["Before Partition — Term 3"] direction LR BL["S1 (Leader)"] --> BF1["S2"] BL --> BF2["S3"] BL --> BF3["S4"] BL --> BF4["S5"] end Before -->|"Network splits"| During subgraph During["During Partition"] direction LR subgraph Majority["Majority Partition (3 nodes)"] ML["S1 (Leader, term 3)"] MF1["S2"] MF2["S3"] end subgraph Minority["Minority Partition (2 nodes)"] MN1["S4 → Candidate → Leader?
term 4, but can't get majority"] MN2["S5"] end end During -->|"Partition heals"| After subgraph After["After Partition Heals"] direction LR AL["S1 (Leader, term 3)
committed entries preserved"] AF1["S2"] AF2["S3"] AF3["S4 steps down
reverts uncommitted entries"] AF4["S5 follows S1"] end style BL fill:#ffd43b,stroke:#f59f00,color:#333 style BF1 fill:#546e7a,stroke:#90a4ae,color:#fff style BF2 fill:#546e7a,stroke:#90a4ae,color:#fff style BF3 fill:#546e7a,stroke:#90a4ae,color:#fff style BF4 fill:#546e7a,stroke:#90a4ae,color:#fff style ML fill:#ffd43b,stroke:#f59f00,color:#333 style MF1 fill:#51cf66,stroke:#37b24d,color:#fff style MF2 fill:#51cf66,stroke:#37b24d,color:#fff style MN1 fill:#ff6b6b,stroke:#d44,color:#fff style MN2 fill:#ff6b6b,stroke:#d44,color:#fff style AL fill:#ffd43b,stroke:#f59f00,color:#333 style AF1 fill:#51cf66,stroke:#37b24d,color:#fff style AF2 fill:#51cf66,stroke:#37b24d,color:#fff style AF3 fill:#546e7a,stroke:#90a4ae,color:#fff style AF4 fill:#546e7a,stroke:#90a4ae,color:#fff

Raft partition safety. Only the partition with a majority can elect a leader and commit entries. The minority partition’s candidate can never get enough votes. When the partition heals, stale leaders discover the higher term and step down, reverting any uncommitted entries.


Level 3 — Paxos

8. Basic Paxos (Single Decree)

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant P as Proposer participant A1 as Acceptor 1 participant A2 as Acceptor 2 participant A3 as Acceptor 3 participant L as Learner Note over P,A3: Phase 1 — Prepare P->>A1: Prepare(n=1) P->>A2: Prepare(n=1) P->>A3: Prepare(n=1) A1-->>P: Promise(n=1, no prior accepted) A2-->>P: Promise(n=1, no prior accepted) A3-->>P: Promise(n=1, no prior accepted) Note over P: Majority promised
Free to propose any value Note over P,A3: Phase 2 — Accept P->>A1: Accept(n=1, v="x=5") P->>A2: Accept(n=1, v="x=5") P->>A3: Accept(n=1, v="x=5") A1-->>P: Accepted(n=1) A2-->>P: Accepted(n=1) A3-->>P: Accepted(n=1) Note over P: Majority accepted
Value "x=5" is CHOSEN par Notify learners A1->>L: Accepted(n=1, v="x=5") A2->>L: Accepted(n=1, v="x=5") end Note over L: Learned: x=5

Basic Paxos. Two phases — Prepare gets promises from a majority of acceptors, then Accept proposes a value. If a majority accepts, the value is chosen. The key invariant: a promise means “I won’t accept any proposal numbered less than n.”


9. Paxos with Competing Proposers

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant P1 as Proposer 1 participant A1 as Acceptor 1 participant A2 as Acceptor 2 participant A3 as Acceptor 3 participant P2 as Proposer 2 Note over P1,P2: Two proposers compete P1->>A1: Prepare(n=1) P1->>A2: Prepare(n=1) A1-->>P1: Promise(n=1) A2-->>P1: Promise(n=1) Note over P1: Has majority promises for n=1 P2->>A2: Prepare(n=2) P2->>A3: Prepare(n=2) A2-->>P2: Promise(n=2) A3-->>P2: Promise(n=2) Note over P2: Has majority promises for n=2
A2 broke promise to P1 P1->>A1: Accept(n=1, v="x=5") P1->>A2: Accept(n=1, v="x=5") A1-->>P1: Accepted(n=1) A2-->>P1: REJECTED (promised n=2) Note over P1: Only 1 accept
No majority — FAILED P2->>A2: Accept(n=2, v="x=7") P2->>A3: Accept(n=2, v="x=7") A2-->>P2: Accepted(n=2) A3-->>P2: Accepted(n=2) Note over P2: Majority accepted
Value "x=7" is CHOSEN

Dueling proposers. When two proposers compete, the higher proposal number wins. Acceptor 2 breaks its promise to Proposer 1 when it promises Proposer 2’s higher-numbered proposal. This can cause livelock — each proposer keeps preempting the other. Multi-Paxos solves this with a stable leader.


10. Multi-Paxos Optimization

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant C as Client participant L as Stable Leader
(Distinguished Proposer) participant A1 as Acceptor 1 participant A2 as Acceptor 2 participant A3 as Acceptor 3 Note over L,A3: Phase 1 — One-time Prepare (leadership) L->>A1: Prepare(n=1) L->>A2: Prepare(n=1) L->>A3: Prepare(n=1) A1-->>L: Promise(n=1) A2-->>L: Promise(n=1) A3-->>L: Promise(n=1) Note over L: Leader established
Skip Phase 1 for subsequent rounds rect rgb(30, 60, 30) Note over C,A3: Round 1 — Accept only (Phase 2) C->>L: SET x=5 L->>A1: Accept(n=1, slot=1, v="x=5") L->>A2: Accept(n=1, slot=1, v="x=5") L->>A3: Accept(n=1, slot=1, v="x=5") A1-->>L: Accepted A2-->>L: Accepted L-->>C: OK: x=5 committed end rect rgb(30, 30, 60) Note over C,A3: Round 2 — Accept only (Phase 2) C->>L: SET y=10 L->>A1: Accept(n=1, slot=2, v="y=10") L->>A2: Accept(n=1, slot=2, v="y=10") L->>A3: Accept(n=1, slot=2, v="y=10") A1-->>L: Accepted A2-->>L: Accepted L-->>C: OK: y=10 committed end rect rgb(60, 30, 30) Note over C,A3: Round 3 — Accept only (Phase 2) C->>L: SET z=15 L->>A1: Accept(n=1, slot=3, v="z=15") L->>A2: Accept(n=1, slot=3, v="z=15") L->>A3: Accept(n=1, slot=3, v="z=15") A1-->>L: Accepted A2-->>L: Accepted L-->>C: OK: z=15 committed end

Multi-Paxos. A stable leader runs Phase 1 once, then handles all subsequent client requests with Phase 2 only — one round-trip instead of two. This is how Paxos works in practice. Google’s Chubby and Spanner both use Multi-Paxos variants.


Level 4 — Advanced Consensus

11. Raft vs Paxos Architecture

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TB subgraph Raft["Raft — Designed for Understandability"] direction TB RL["Strong Leader
handles ALL operations"] RL --> RLR["Log Replication
leader → followers"] RL --> RLE["Leader Election
randomized timeouts"] RL --> RSM["Safety
election restriction +
log matching"] end subgraph Paxos["Paxos — Designed for Correctness"] direction TB PB["Basic Paxos
single value agreement"] PB --> PMP["Multi-Paxos
log of values"] PB --> PFP["Flexible Paxos
non-intersecting quorums"] PB --> PEP["EPaxos
leaderless, commutative"] end style RL fill:#4a9eff,stroke:#2d7ed8,color:#fff style RLR fill:#4a9eff,stroke:#2d7ed8,color:#fff style RLE fill:#4a9eff,stroke:#2d7ed8,color:#fff style RSM fill:#4a9eff,stroke:#2d7ed8,color:#fff style PB fill:#a29bfe,stroke:#6c5ce7,color:#fff style PMP fill:#a29bfe,stroke:#6c5ce7,color:#fff style PFP fill:#a29bfe,stroke:#6c5ce7,color:#fff style PEP fill:#a29bfe,stroke:#6c5ce7,color:#fff
Aspect Raft Paxos
Design goal Understandability Formal correctness
Leader Required, strong Optional (Multi-Paxos uses one)
Log structure Contiguous, no gaps Slots can have gaps
Membership change Joint consensus Reconfiguration protocol
Implementations etcd, Consul, TiKV Chubby, Spanner, Megastore
Learning curve Moderate Very steep

12. Consensus in Practice

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TB subgraph Raft_Systems["Raft-Based Systems"] ETCD["etcd
Kubernetes config store"] CONSUL["Consul
Service discovery"] TIKV["TiKV
Distributed KV store"] CRDB["CockroachDB
Distributed SQL"] end subgraph Paxos_Systems["Paxos-Based Systems"] CHUBBY["Chubby
Google's lock service"] SPANNER["Spanner
Google's global DB"] MEGA["Megastore
Google's entity store"] end subgraph ZAB_Systems["ZAB-Based Systems"] ZK["ZooKeeper
Distributed coordination"] KAFKA_META["Kafka (KRaft)
Metadata management"] end subgraph Use_Cases["What Gets Decided"] direction LR UC1["Leader election"] UC2["Configuration changes"] UC3["Distributed locks"] UC4["Transaction commit"] end Raft_Systems --> Use_Cases Paxos_Systems --> Use_Cases ZAB_Systems --> Use_Cases style ETCD fill:#4a9eff,stroke:#2d7ed8,color:#fff style CONSUL fill:#4a9eff,stroke:#2d7ed8,color:#fff style TIKV fill:#4a9eff,stroke:#2d7ed8,color:#fff style CRDB fill:#4a9eff,stroke:#2d7ed8,color:#fff style CHUBBY fill:#a29bfe,stroke:#6c5ce7,color:#fff style SPANNER fill:#a29bfe,stroke:#6c5ce7,color:#fff style MEGA fill:#a29bfe,stroke:#6c5ce7,color:#fff style ZK fill:#27ae60,stroke:#1e8449,color:#fff style KAFKA_META fill:#27ae60,stroke:#1e8449,color:#fff style UC1 fill:#ffd43b,stroke:#f59f00,color:#333 style UC2 fill:#ffd43b,stroke:#f59f00,color:#333 style UC3 fill:#ffd43b,stroke:#f59f00,color:#333 style UC4 fill:#ffd43b,stroke:#f59f00,color:#333

Consensus everywhere. Every distributed database, every coordination service, every leader election mechanism is built on consensus. Raft dominates open-source (simpler to implement and debug). Paxos dominates Google-scale systems (more flexible, more optimizable).


TL;DR — When to Use What

Scenario Algorithm Why
New distributed system Raft Well-documented, many libraries, easier to debug
Google-scale global DB Multi-Paxos Skip Phase 1 optimization, flexible quorums
ZooKeeper-based infra ZAB Purpose-built for ordered broadcast
Leaderless / low-latency EPaxos No leader bottleneck, but complex
Learning consensus Raft Designed specifically to be understandable
quadrantChart title Consensus Algorithm Trade-offs x-axis "Simple" --> "Complex" y-axis "Rigid" --> "Flexible" quadrant-1 "Paxos Family" quadrant-2 "Emerging" quadrant-3 "Practical Choice" quadrant-4 "Specialized" "Raft": [0.30, 0.35] "Basic Paxos": [0.65, 0.60] "Multi-Paxos": [0.75, 0.80] "EPaxos": [0.85, 0.90] "ZAB": [0.50, 0.25] "Viewstamped Replication": [0.55, 0.45] "Flexible Paxos": [0.80, 0.95]

  • Building a new system? → Start with Raft. Libraries exist for every language.
  • Need sub-millisecond consensus? → Look at Multi-Paxos or EPaxos.
  • Already using ZooKeeper? → ZAB handles it. Don’t reinvent the wheel.
  • Learning distributed systems? → Read the Raft paper. It was literally designed to be taught.