How to Read This Post

Each scenario shows a diagram first, then a short note on why the pattern matters. Complexity increases as you scroll.

Algorithm	Approach	Strength	Trade-off
Raft	Leader-based	Understandable, widely adopted	Leader bottleneck
Paxos	Proposer-based	Formally proven, flexible	Notoriously hard to implement
Multi-Paxos	Optimized Paxos	Fewer round-trips with stable leader	Complex recovery
ZAB	Leader-based	Ordered broadcast (ZooKeeper)	Tightly coupled to ZK

Level 1 — Why Consensus?

1. The Agreement Problem

The core problem. Distributed nodes must agree on a single value even when some nodes crash or messages are delayed. Without consensus, you get split-brain — two parts of the system believing different things.

2. The FLP Impossibility

Fischer, Lynch, Paterson (1985). No deterministic algorithm can guarantee consensus in an asynchronous system with even one faulty process. Practical systems like Raft use randomized timeouts to work around this theoretical limit.

3. Leader vs Leaderless Consensus

Aspect	Leader-Based (Raft, Paxos)	Leaderless (Gossip, CRDT)
Ordering	Strong, total order	Eventual, partial order
Write path	Single leader	Any node
Failure handling	Leader election	Automatic, no single point
Use case	Metadata, coordination	AP systems, counters

Level 2 — Raft Deep Dive

4. Raft Leader Election

Raft leader election. When a follower’s election timer expires (randomized 150-300ms), it becomes a candidate and requests votes. First candidate to get a majority wins. Randomized timeouts prevent perpetual split votes.

5. Raft Log Replication

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant C as Client participant L as Leader participant F1 as Follower 1 participant F2 as Follower 2 participant F3 as Follower 3 participant F4 as Follower 4 C->>L: Write: SET user=alice Note over L: Append to log (uncommitted)
Log[8] = {term:3, SET user=alice} par Replicate to all followers L->>F1: AppendEntries(index=8, term=3) L->>F2: AppendEntries(index=8, term=3) L->>F3: AppendEntries(index=8, term=3) L->>F4: AppendEntries(index=8, term=3) end F1-->>L: Success (log matched) F2-->>L: Success (log matched) F3-->>L: Success (log matched) Note over F4: Slow / partitioned Note over L: 3/4 followers + self = 4/5
Majority reached → COMMIT L->>L: Apply Log[8] to state machine L-->>C: OK: user=alice committed par Notify commit index L->>F1: AppendEntries(commitIndex=8) L->>F2: AppendEntries(commitIndex=8) L->>F3: AppendEntries(commitIndex=8) L->>F4: AppendEntries(commitIndex=8) end F1->>F1: Apply Log[8] F2->>F2: Apply Log[8] F3->>F3: Apply Log[8]

Raft log replication. The leader appends to its local log, replicates to followers, and commits once a majority acknowledges. Committed entries are guaranteed durable — even if the leader crashes, the new leader will have this entry.

6. Raft Term Progression

Raft terms. Each term has at most one leader. Terms act as a logical clock — any server that sees a higher term immediately steps down. Split votes cause empty terms, resolved by re-election with new randomized timeouts.

7. Raft Network Partition

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TB subgraph Before["Before Partition — Term 3"] direction LR BL["S1 (Leader)"] --> BF1["S2"] BL --> BF2["S3"] BL --> BF3["S4"] BL --> BF4["S5"] end Before -->|"Network splits"| During subgraph During["During Partition"] direction LR subgraph Majority["Majority Partition (3 nodes)"] ML["S1 (Leader, term 3)"] MF1["S2"] MF2["S3"] end subgraph Minority["Minority Partition (2 nodes)"] MN1["S4 → Candidate → Leader?
term 4, but can't get majority"] MN2["S5"] end end During -->|"Partition heals"| After subgraph After["After Partition Heals"] direction LR AL["S1 (Leader, term 3)
committed entries preserved"] AF1["S2"] AF2["S3"] AF3["S4 steps down
reverts uncommitted entries"] AF4["S5 follows S1"] end style BL fill:#ffd43b,stroke:#f59f00,color:#333 style BF1 fill:#546e7a,stroke:#90a4ae,color:#fff style BF2 fill:#546e7a,stroke:#90a4ae,color:#fff style BF3 fill:#546e7a,stroke:#90a4ae,color:#fff style BF4 fill:#546e7a,stroke:#90a4ae,color:#fff style ML fill:#ffd43b,stroke:#f59f00,color:#333 style MF1 fill:#51cf66,stroke:#37b24d,color:#fff style MF2 fill:#51cf66,stroke:#37b24d,color:#fff style MN1 fill:#ff6b6b,stroke:#d44,color:#fff style MN2 fill:#ff6b6b,stroke:#d44,color:#fff style AL fill:#ffd43b,stroke:#f59f00,color:#333 style AF1 fill:#51cf66,stroke:#37b24d,color:#fff style AF2 fill:#51cf66,stroke:#37b24d,color:#fff style AF3 fill:#546e7a,stroke:#90a4ae,color:#fff style AF4 fill:#546e7a,stroke:#90a4ae,color:#fff

Raft partition safety. Only the partition with a majority can elect a leader and commit entries. The minority partition’s candidate can never get enough votes. When the partition heals, stale leaders discover the higher term and step down, reverting any uncommitted entries.

Level 3 — Paxos

8. Basic Paxos (Single Decree)

Basic Paxos. Two phases — Prepare gets promises from a majority of acceptors, then Accept proposes a value. If a majority accepts, the value is chosen. The key invariant: a promise means “I won’t accept any proposal numbered less than n.”

9. Paxos with Competing Proposers

Dueling proposers. When two proposers compete, the higher proposal number wins. Acceptor 2 breaks its promise to Proposer 1 when it promises Proposer 2’s higher-numbered proposal. This can cause livelock — each proposer keeps preempting the other. Multi-Paxos solves this with a stable leader.

10. Multi-Paxos Optimization

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% sequenceDiagram participant C as Client participant L as Stable Leader
(Distinguished Proposer) participant A1 as Acceptor 1 participant A2 as Acceptor 2 participant A3 as Acceptor 3 Note over L,A3: Phase 1 — One-time Prepare (leadership) L->>A1: Prepare(n=1) L->>A2: Prepare(n=1) L->>A3: Prepare(n=1) A1-->>L: Promise(n=1) A2-->>L: Promise(n=1) A3-->>L: Promise(n=1) Note over L: Leader established
Skip Phase 1 for subsequent rounds rect rgb(30, 60, 30) Note over C,A3: Round 1 — Accept only (Phase 2) C->>L: SET x=5 L->>A1: Accept(n=1, slot=1, v="x=5") L->>A2: Accept(n=1, slot=1, v="x=5") L->>A3: Accept(n=1, slot=1, v="x=5") A1-->>L: Accepted A2-->>L: Accepted L-->>C: OK: x=5 committed end rect rgb(30, 30, 60) Note over C,A3: Round 2 — Accept only (Phase 2) C->>L: SET y=10 L->>A1: Accept(n=1, slot=2, v="y=10") L->>A2: Accept(n=1, slot=2, v="y=10") L->>A3: Accept(n=1, slot=2, v="y=10") A1-->>L: Accepted A2-->>L: Accepted L-->>C: OK: y=10 committed end rect rgb(60, 30, 30) Note over C,A3: Round 3 — Accept only (Phase 2) C->>L: SET z=15 L->>A1: Accept(n=1, slot=3, v="z=15") L->>A2: Accept(n=1, slot=3, v="z=15") L->>A3: Accept(n=1, slot=3, v="z=15") A1-->>L: Accepted A2-->>L: Accepted L-->>C: OK: z=15 committed end

Multi-Paxos. A stable leader runs Phase 1 once, then handles all subsequent client requests with Phase 2 only — one round-trip instead of two. This is how Paxos works in practice. Google’s Chubby and Spanner both use Multi-Paxos variants.

Level 4 — Advanced Consensus

11. Raft vs Paxos Architecture

Aspect	Raft	Paxos
Design goal	Understandability	Formal correctness
Leader	Required, strong	Optional (Multi-Paxos uses one)
Log structure	Contiguous, no gaps	Slots can have gaps
Membership change	Joint consensus	Reconfiguration protocol
Implementations	etcd, Consul, TiKV	Chubby, Spanner, Megastore
Learning curve	Moderate	Very steep

12. Consensus in Practice

%%{init: {'theme':'dark', 'themeVariables': {'primaryTextColor':'#e5e7eb','secondaryTextColor':'#e5e7eb','tertiaryTextColor':'#e5e7eb','textColor':'#e5e7eb','nodeTextColor':'#e5e7eb','edgeLabelText':'#e5e7eb','clusterTextColor':'#e5e7eb','actorTextColor':'#e5e7eb'}}}%% flowchart TB subgraph Raft_Systems["Raft-Based Systems"] ETCD["etcd
Kubernetes config store"] CONSUL["Consul
Service discovery"] TIKV["TiKV
Distributed KV store"] CRDB["CockroachDB
Distributed SQL"] end subgraph Paxos_Systems["Paxos-Based Systems"] CHUBBY["Chubby
Google's lock service"] SPANNER["Spanner
Google's global DB"] MEGA["Megastore
Google's entity store"] end subgraph ZAB_Systems["ZAB-Based Systems"] ZK["ZooKeeper
Distributed coordination"] KAFKA_META["Kafka (KRaft)
Metadata management"] end subgraph Use_Cases["What Gets Decided"] direction LR UC1["Leader election"] UC2["Configuration changes"] UC3["Distributed locks"] UC4["Transaction commit"] end Raft_Systems --> Use_Cases Paxos_Systems --> Use_Cases ZAB_Systems --> Use_Cases style ETCD fill:#4a9eff,stroke:#2d7ed8,color:#fff style CONSUL fill:#4a9eff,stroke:#2d7ed8,color:#fff style TIKV fill:#4a9eff,stroke:#2d7ed8,color:#fff style CRDB fill:#4a9eff,stroke:#2d7ed8,color:#fff style CHUBBY fill:#a29bfe,stroke:#6c5ce7,color:#fff style SPANNER fill:#a29bfe,stroke:#6c5ce7,color:#fff style MEGA fill:#a29bfe,stroke:#6c5ce7,color:#fff style ZK fill:#27ae60,stroke:#1e8449,color:#fff style KAFKA_META fill:#27ae60,stroke:#1e8449,color:#fff style UC1 fill:#ffd43b,stroke:#f59f00,color:#333 style UC2 fill:#ffd43b,stroke:#f59f00,color:#333 style UC3 fill:#ffd43b,stroke:#f59f00,color:#333 style UC4 fill:#ffd43b,stroke:#f59f00,color:#333

Consensus everywhere. Every distributed database, every coordination service, every leader election mechanism is built on consensus. Raft dominates open-source (simpler to implement and debug). Paxos dominates Google-scale systems (more flexible, more optimizable).

TL;DR — When to Use What

Scenario	Algorithm	Why
New distributed system	Raft	Well-documented, many libraries, easier to debug
Google-scale global DB	Multi-Paxos	Skip Phase 1 optimization, flexible quorums
ZooKeeper-based infra	ZAB	Purpose-built for ordered broadcast
Leaderless / low-latency	EPaxos	No leader bottleneck, but complex
Learning consensus	Raft	Designed specifically to be understandable

quadrantChart title Consensus Algorithm Trade-offs x-axis "Simple" --> "Complex" y-axis "Rigid" --> "Flexible" quadrant-1 "Paxos Family" quadrant-2 "Emerging" quadrant-3 "Practical Choice" quadrant-4 "Specialized" "Raft": [0.30, 0.35] "Basic Paxos": [0.65, 0.60] "Multi-Paxos": [0.75, 0.80] "EPaxos": [0.85, 0.90] "ZAB": [0.50, 0.25] "Viewstamped Replication": [0.55, 0.45] "Flexible Paxos": [0.80, 0.95]

Building a new system? → Start with Raft. Libraries exist for every language.
Need sub-millisecond consensus? → Look at Multi-Paxos or EPaxos.
Already using ZooKeeper? → ZAB handles it. Don’t reinvent the wheel.
Learning distributed systems? → Read the Raft paper. It was literally designed to be taught.

Consensus Algorithms: A Visual Guide

How to Read This Post

Level 1 — Why Consensus?

1. The Agreement Problem

2. The FLP Impossibility

3. Leader vs Leaderless Consensus

Level 2 — Raft Deep Dive

4. Raft Leader Election

5. Raft Log Replication

6. Raft Term Progression

7. Raft Network Partition

Level 3 — Paxos

8. Basic Paxos (Single Decree)

9. Paxos with Competing Proposers

10. Multi-Paxos Optimization

Level 4 — Advanced Consensus

11. Raft vs Paxos Architecture

12. Consensus in Practice

TL;DR — When to Use What

AI Assistant

Hi! I'm your AI assistant

How to Read This Post#

Level 1 — Why Consensus?#

1. The Agreement Problem#

2. The FLP Impossibility#

3. Leader vs Leaderless Consensus#

Level 2 — Raft Deep Dive#

4. Raft Leader Election#

5. Raft Log Replication#

6. Raft Term Progression#

7. Raft Network Partition#

Level 3 — Paxos#

8. Basic Paxos (Single Decree)#

9. Paxos with Competing Proposers#

10. Multi-Paxos Optimization#

Level 4 — Advanced Consensus#

11. Raft vs Paxos Architecture#

12. Consensus in Practice#

TL;DR — When to Use What#

How to Read This Post

Level 1 — Why Consensus?

1. The Agreement Problem

2. The FLP Impossibility

3. Leader vs Leaderless Consensus

Level 2 — Raft Deep Dive

4. Raft Leader Election

5. Raft Log Replication

6. Raft Term Progression

7. Raft Network Partition

Level 3 — Paxos

8. Basic Paxos (Single Decree)

9. Paxos with Competing Proposers

10. Multi-Paxos Optimization

Level 4 — Advanced Consensus

11. Raft vs Paxos Architecture

12. Consensus in Practice

TL;DR — When to Use What