The Bully Election Algorithm
The Bully Algorithm, proposed by Hector Garcia-Molina in 1982, is a classic leader election algorithm for distributed systems. It’s called “bully” because the highest-numbered process always wins and “bullies” the others into accepting it as leader.
The Scenario
A distributed system needs a coordinator:
- N nodes in a network
- Each node has a unique ID (priority)
- One node must be elected as leader
- When the leader fails, a new leader must be elected
- Rule: The node with the highest ID wins
The protocol:
- Any node can start an election when it detects leader failure
- Node sends “ELECTION” message to all higher-ID nodes
- If no response → declare victory
- If response → wait for “COORDINATOR” announcement
- New leader broadcasts “COORDINATOR” message to all
The Challenge
sequenceDiagram
participant N1 as Node 1
participant N2 as Node 2
participant N3 as Node 3
participant N4 as Node 4
participant N5 as Node 5 (Leader)
Note over N5: Leader Crashes!
N5--xN1: ❌
Note over N2: Detects failure,
starts election N2->>N3: ELECTION N2->>N4: ELECTION N2->>N5: ELECTION (no response) N3->>N4: ELECTION N3->>N5: ELECTION (no response) N4->>N5: ELECTION (no response) Note over N4: Highest alive,
becomes leader N4->>N1: COORDINATOR N4->>N2: COORDINATOR N4->>N3: COORDINATOR
starts election N2->>N3: ELECTION N2->>N4: ELECTION N2->>N5: ELECTION (no response) N3->>N4: ELECTION N3->>N5: ELECTION (no response) N4->>N5: ELECTION (no response) Note over N4: Highest alive,
becomes leader N4->>N1: COORDINATOR N4->>N2: COORDINATOR N4->>N3: COORDINATOR
Real-World Applications
This pattern appears in many distributed systems:
- Apache ZooKeeper: Leader election for distributed coordination
- etcd: Leader election in distributed key-value store
- MongoDB Replica Sets: Primary election
- RabbitMQ Clustering: Master node election
- Kubernetes: Controller leader election
- Elasticsearch: Master node election
Implementation in Go
package main
import (
"context"
"fmt"
"math/rand"
"sync"
"time"
)
// NodeState represents the current state of a node
type NodeState int
const (
Follower NodeState = iota
Candidate
Leader
)
func (s NodeState) String() string {
return []string{"Follower", "Candidate", "Leader"}[s]
}
// Message types
type MessageType string
const (
Election MessageType = "ELECTION"
OK MessageType = "OK"
Coordinator MessageType = "COORDINATOR"
Heartbeat MessageType = "HEARTBEAT"
)
// Message sent between nodes
type Message struct {
Type MessageType
SenderID int
Term int // Election term for detecting stale messages
}
// Node represents a process in the distributed system
type Node struct {
ID int
State NodeState
CurrentLeader int
IsAlive bool
Term int // Current election term
Peers []*Node
IncomingMessages chan Message
HeartbeatReceived time.Time
mu sync.Mutex
ctx context.Context
cancel context.CancelFunc
wg sync.WaitGroup
}
// BullySystem coordinates the nodes
type BullySystem struct {
Nodes []*Node
mu sync.Mutex
Elections int // Track number of elections
}
func NewBullySystem(numNodes int) *BullySystem {
nodes := make([]*Node, numNodes)
for i := 0; i < numNodes; i++ {
ctx, cancel := context.WithCancel(context.Background())
nodes[i] = &Node{
ID: i,
State: Follower,
CurrentLeader: -1,
IsAlive: true,
IncomingMessages: make(chan Message, 100),
HeartbeatReceived: time.Now(),
ctx: ctx,
cancel: cancel,
}
}
// Set up peer connections
for _, node := range nodes {
node.Peers = nodes
}
return &BullySystem{
Nodes: nodes,
}
}
// Start begins node operations
func (n *Node) Start() {
n.wg.Add(1)
go n.run()
}
// Stop gracefully stops the node
func (n *Node) Stop() {
n.cancel()
n.wg.Wait()
}
// run is the main event loop for the node
func (n *Node) run() {
defer n.wg.Done()
heartbeatTicker := time.NewTicker(500 * time.Millisecond)
defer heartbeatTicker.Stop()
electionTimeout := time.NewTicker(2 * time.Second)
defer electionTimeout.Stop()
for {
select {
case <-n.ctx.Done():
return
case msg := <-n.IncomingMessages:
n.handleMessage(msg)
case <-heartbeatTicker.C:
n.sendHeartbeat()
case <-electionTimeout.C:
n.checkLeader()
}
}
}
// handleMessage processes incoming messages
func (n *Node) handleMessage(msg Message) {
n.mu.Lock()
defer n.mu.Unlock()
if !n.IsAlive {
return
}
switch msg.Type {
case Election:
// Received election from lower-ID node
fmt.Printf("[Node %d] 📨 Received ELECTION from Node %d\n", n.ID, msg.SenderID)
if n.ID > msg.SenderID {
// Send OK response
n.sendTo(msg.SenderID, Message{
Type: OK,
SenderID: n.ID,
Term: n.Term,
})
fmt.Printf("[Node %d] ✉️ Sent OK to Node %d\n", n.ID, msg.SenderID)
// Start own election
go n.StartElection()
}
case OK:
// Higher-ID node responded
fmt.Printf("[Node %d] ✅ Received OK from Node %d (higher priority)\n", n.ID, msg.SenderID)
n.State = Follower
case Coordinator:
// New leader announcement
fmt.Printf("[Node %d] 👑 Node %d is the new COORDINATOR\n", n.ID, msg.SenderID)
n.CurrentLeader = msg.SenderID
n.State = Follower
n.Term = msg.Term
n.HeartbeatReceived = time.Now()
case Heartbeat:
// Leader heartbeat
if msg.SenderID == n.CurrentLeader {
n.HeartbeatReceived = time.Now()
}
}
}
// StartElection initiates an election
func (n *Node) StartElection() {
n.mu.Lock()
if !n.IsAlive || n.State == Candidate {
n.mu.Unlock()
return
}
n.State = Candidate
n.Term++
term := n.Term
n.mu.Unlock()
fmt.Printf("\n[Node %d] 🗳️ Starting ELECTION (term %d)\n", n.ID, term)
// Send ELECTION to all higher-ID nodes
higherNodes := n.getHigherNodes()
if len(higherNodes) == 0 {
// No higher nodes, become leader
n.becomeLeader()
return
}
// Send election messages
for _, node := range higherNodes {
n.sendTo(node.ID, Message{
Type: Election,
SenderID: n.ID,
Term: term,
})
}
// Wait for OK responses (with timeout)
time.Sleep(1 * time.Second)
n.mu.Lock()
defer n.mu.Unlock()
// If still candidate, no higher node responded
if n.State == Candidate {
n.mu.Unlock()
n.becomeLeader()
n.mu.Lock()
}
}
// becomeLeader transitions node to leader
func (n *Node) becomeLeader() {
n.mu.Lock()
n.State = Leader
n.CurrentLeader = n.ID
term := n.Term
n.mu.Unlock()
fmt.Printf("\n[Node %d] 🎖️ Became LEADER (term %d)\n", n.ID, term)
// Broadcast coordinator message
for _, peer := range n.Peers {
if peer.ID != n.ID {
n.sendTo(peer.ID, Message{
Type: Coordinator,
SenderID: n.ID,
Term: term,
})
}
}
}
// sendHeartbeat sends periodic heartbeats if leader
func (n *Node) sendHeartbeat() {
n.mu.Lock()
defer n.mu.Unlock()
if !n.IsAlive || n.State != Leader {
return
}
for _, peer := range n.Peers {
if peer.ID != n.ID && peer.IsAlive {
n.sendTo(peer.ID, Message{
Type: Heartbeat,
SenderID: n.ID,
Term: n.Term,
})
}
}
}
// checkLeader checks if leader is still alive
func (n *Node) checkLeader() {
n.mu.Lock()
defer n.mu.Unlock()
if !n.IsAlive || n.State == Leader {
return
}
// Check if heartbeat timeout
if time.Since(n.HeartbeatReceived) > 3*time.Second && n.CurrentLeader != -1 {
fmt.Printf("\n[Node %d] ⚠️ Leader timeout detected!\n", n.ID)
n.mu.Unlock()
n.StartElection()
n.mu.Lock()
}
}
// getHigherNodes returns all nodes with higher IDs
func (n *Node) getHigherNodes() []*Node {
var higher []*Node
for _, peer := range n.Peers {
peer.mu.Lock()
if peer.ID > n.ID && peer.IsAlive {
higher = append(higher, peer)
}
peer.mu.Unlock()
}
return higher
}
// sendTo sends a message to a specific node
func (n *Node) sendTo(targetID int, msg Message) {
if targetID < 0 || targetID >= len(n.Peers) {
return
}
target := n.Peers[targetID]
target.mu.Lock()
defer target.mu.Unlock()
if target.IsAlive {
select {
case target.IncomingMessages <- msg:
case <-time.After(100 * time.Millisecond):
// Message dropped
}
}
}
// Crash simulates node failure
func (n *Node) Crash() {
n.mu.Lock()
defer n.mu.Unlock()
if !n.IsAlive {
return
}
fmt.Printf("\n[Node %d] 💥 CRASHED\n", n.ID)
n.IsAlive = false
n.State = Follower
}
// Recover simulates node recovery
func (n *Node) Recover() {
n.mu.Lock()
defer n.mu.Unlock()
if n.IsAlive {
return
}
fmt.Printf("\n[Node %d] 🔄 RECOVERED\n", n.ID)
n.IsAlive = true
n.CurrentLeader = -1
n.HeartbeatReceived = time.Now()
// Start election after recovery
go func() {
time.Sleep(500 * time.Millisecond)
n.StartElection()
}()
}
// GetLeader returns current leader info
func (bs *BullySystem) GetLeader() (leaderID int, found bool) {
bs.mu.Lock()
defer bs.mu.Unlock()
for _, node := range bs.Nodes {
node.mu.Lock()
if node.IsAlive && node.State == Leader {
node.mu.Unlock()
return node.ID, true
}
node.mu.Unlock()
}
return -1, false
}
// PrintStatus prints system status
func (bs *BullySystem) PrintStatus() {
fmt.Println("\n📊 System Status:")
fmt.Println(" Node | State | Leader | Alive")
fmt.Println(" -----|-----------|--------|------")
for _, node := range bs.Nodes {
node.mu.Lock()
fmt.Printf(" %4d | %-9s | %6d | %v\n",
node.ID, node.State, node.CurrentLeader, node.IsAlive)
node.mu.Unlock()
}
}
func main() {
rand.Seed(42)
fmt.Println("=== Bully Election Algorithm ===\n")
const numNodes = 5
system := NewBullySystem(numNodes)
// Start all nodes
fmt.Println("🚀 Starting all nodes...")
for _, node := range system.Nodes {
node.Start()
}
// Wait for initial election
time.Sleep(1 * time.Second)
system.PrintStatus()
// Node 4 (highest) should be leader
if leader, found := system.GetLeader(); found {
fmt.Printf("\n✅ Initial leader: Node %d\n", leader)
}
// Scenario 1: Leader crashes
fmt.Println("\n\n" + "=== Scenario 1: Leader Crash ===" + "\n")
time.Sleep(1 * time.Second)
system.Nodes[4].Crash()
time.Sleep(3 * time.Second)
system.PrintStatus()
if leader, found := system.GetLeader(); found {
fmt.Printf("\n✅ New leader after crash: Node %d\n", leader)
}
// Scenario 2: Original leader recovers
fmt.Println("\n\n" + "=== Scenario 2: Leader Recovery ===" + "\n")
time.Sleep(1 * time.Second)
system.Nodes[4].Recover()
time.Sleep(3 * time.Second)
system.PrintStatus()
if leader, found := system.GetLeader(); found {
fmt.Printf("\n✅ Leader after recovery: Node %d\n", leader)
}
// Scenario 3: Multiple simultaneous failures
fmt.Println("\n\n" + "=== Scenario 3: Multiple Failures ===" + "\n")
time.Sleep(1 * time.Second)
system.Nodes[4].Crash()
system.Nodes[3].Crash()
time.Sleep(3 * time.Second)
system.PrintStatus()
if leader, found := system.GetLeader(); found {
fmt.Printf("\n✅ Leader after multiple failures: Node %d\n", leader)
}
// Cleanup
fmt.Println("\n\n🛑 Shutting down...")
for _, node := range system.Nodes {
node.Stop()
}
fmt.Println("\n✓ Simulation complete!")
}
How It Works
1. Election Initiation
graph TD
A[Node detects
leader failure] --> B{Am I highest
alive node?} B -->|Yes| C[Become leader] B -->|No| D[Send ELECTION
to higher nodes] D --> E{Received OK?} E -->|Yes| F[Wait for
COORDINATOR] E -->|No| C
leader failure] --> B{Am I highest
alive node?} B -->|Yes| C[Become leader] B -->|No| D[Send ELECTION
to higher nodes] D --> E{Received OK?} E -->|Yes| F[Wait for
COORDINATOR] E -->|No| C
2. Message Flow
// 1. Detect failure
if time.Since(lastHeartbeat) > timeout {
startElection()
}
// 2. Send ELECTION to higher nodes
for node := range higherNodes {
send(node, ELECTION)
}
// 3. If OK received → wait
// 4. If timeout → become leader
if !receivedOK {
becomeLeader()
}
3. Handling Simultaneous Elections
When multiple nodes start elections simultaneously:
// Node 2 starts election
// Node 3 starts election
// Both send to higher nodes
// Node 3 receives Node 2's ELECTION
// Responds with OK
// Node 2 defers to Node 3
// Node 4 receives both elections
// Responds OK to both
// Starts own election
// Becomes leader (highest)
Advanced: Ring-Based Bully Election
Optimize message complexity using a ring topology:
package main
import (
"fmt"
"sync"
)
// RingNode represents a node in a ring topology
type RingNode struct {
ID int
IsAlive bool
Next *RingNode
LeaderID int
mu sync.Mutex
}
type RingElection struct {
Nodes []*RingNode
}
func NewRingElection(numNodes int) *RingElection {
nodes := make([]*RingNode, numNodes)
// Create nodes
for i := 0; i < numNodes; i++ {
nodes[i] = &RingNode{
ID: i,
IsAlive: true,
LeaderID: -1,
}
}
// Create ring
for i := 0; i < numNodes; i++ {
nodes[i].Next = nodes[(i+1)%numNodes]
}
return &RingElection{Nodes: nodes}
}
// StartElection initiates ring-based election
func (rn *RingNode) StartElection() int {
fmt.Printf("\n[Node %d] 🗳️ Starting ring election\n", rn.ID)
// Pass token with own ID
participantIDs := []int{rn.ID}
// Forward around ring
current := rn.Next
for current.ID != rn.ID {
current.mu.Lock()
if current.IsAlive {
participantIDs = append(participantIDs, current.ID)
fmt.Printf("[Node %d] → Adding self to election\n", current.ID)
}
current.mu.Unlock()
current = current.Next
}
// Find highest ID
maxID := rn.ID
for _, id := range participantIDs {
if id > maxID {
maxID = id
}
}
// Announce leader to all
current = rn
for {
current.mu.Lock()
current.LeaderID = maxID
current.mu.Unlock()
current = current.Next
if current.ID == rn.ID {
break
}
}
fmt.Printf("\n[Ring] 👑 Node %d elected leader\n", maxID)
return maxID
}
func RunRingElection() {
fmt.Println("\n=== Ring-Based Election ===\n")
re := NewRingElection(5)
// Simulate some node failures
re.Nodes[4].IsAlive = false // Highest node down
re.Nodes[3].IsAlive = false
// Start election from node 1
leader := re.Nodes[1].StartElection()
fmt.Printf("\n✅ Elected leader: Node %d\n", leader)
fmt.Println("\nNode States:")
for _, node := range re.Nodes {
fmt.Printf(" Node %d: Alive=%v, Leader=%d\n",
node.ID, node.IsAlive, node.LeaderID)
}
}
func main() {
RunRingElection()
}
Key Concepts
1. Bully Property
The highest-ID process always wins:
- Simple and deterministic
- No ties possible
- Predictable leader
2. Message Complexity
Worst case (Node 0 starts election):
- O(n²) messages total
- Node i sends to (n-i) higher nodes
- Total: n(n-1)/2
Best case (highest node starts):
- O(n) messages
- Single broadcast to all nodes
3. Failure Detection
// Heartbeat-based detection
ticker := time.NewTicker(heartbeatInterval)
for {
select {
case <-ticker.C:
if leader {
sendHeartbeat()
} else {
checkHeartbeat()
}
}
}
Real-World Example: Database Primary Election
package main
import (
"fmt"
"sync"
"time"
)
type DatabaseReplica struct {
ID int
IsPrimary bool
IsOnline bool
DataLag time.Duration // Replication lag
Priority int // Configured priority
mu sync.Mutex
}
type DatabaseCluster struct {
Replicas []*DatabaseReplica
mu sync.Mutex
}
func (dc *DatabaseCluster) ElectPrimary() *DatabaseReplica {
dc.mu.Lock()
defer dc.mu.Unlock()
fmt.Println("\n🗳️ Electing database primary...")
// Find eligible candidates
var candidates []*DatabaseReplica
for _, replica := range dc.Replicas {
replica.mu.Lock()
if replica.IsOnline && replica.DataLag < 5*time.Second {
candidates = append(candidates, replica)
fmt.Printf(" Candidate: Replica %d (lag: %v, priority: %d)\n",
replica.ID, replica.DataLag, replica.Priority)
}
replica.mu.Unlock()
}
if len(candidates) == 0 {
fmt.Println("❌ No eligible candidates!")
return nil
}
// Elect based on: 1) Priority, 2) Lowest lag, 3) Highest ID
winner := candidates[0]
for _, candidate := range candidates[1:] {
if candidate.Priority > winner.Priority {
winner = candidate
} else if candidate.Priority == winner.Priority {
if candidate.DataLag < winner.DataLag {
winner = candidate
} else if candidate.DataLag == winner.DataLag && candidate.ID > winner.ID {
winner = candidate
}
}
}
// Set as primary
for _, replica := range dc.Replicas {
replica.mu.Lock()
replica.IsPrimary = (replica.ID == winner.ID)
replica.mu.Unlock()
}
winner.mu.Lock()
winner.IsPrimary = true
winner.mu.Unlock()
fmt.Printf("\n👑 Replica %d elected as PRIMARY\n", winner.ID)
return winner
}
func RunDatabaseElection() {
fmt.Println("=== Database Primary Election ===")
cluster := &DatabaseCluster{
Replicas: []*DatabaseReplica{
{ID: 0, IsOnline: true, DataLag: 100 * time.Millisecond, Priority: 1},
{ID: 1, IsOnline: true, DataLag: 50 * time.Millisecond, Priority: 2},
{ID: 2, IsOnline: true, DataLag: 200 * time.Millisecond, Priority: 1},
{ID: 3, IsOnline: false, DataLag: 0, Priority: 3}, // Offline
},
}
primary := cluster.ElectPrimary()
fmt.Println("\nCluster State:")
for _, replica := range cluster.Replicas {
replica.mu.Lock()
role := "Secondary"
if replica.IsPrimary {
role = "PRIMARY"
}
fmt.Printf(" Replica %d: %s (online=%v, lag=%v, priority=%d)\n",
replica.ID, role, replica.IsOnline, replica.DataLag, replica.Priority)
replica.mu.Unlock()
}
fmt.Printf("\n✅ Primary: Replica %d\n", primary.ID)
}
func main() {
RunDatabaseElection()
}
Performance Characteristics
| Metric | Value |
|---|---|
| Messages | O(n²) worst case |
| Latency | 2 RTT (election + announcement) |
| Elections | O(n) per failure |
| Scalability | Poor for large n |
| Determinism | Perfect (highest always wins) |
Optimizations
1. Fast Path Election
// If node knows it's highest, immediately announce
if n.ID == maxKnownID {
n.becomeLeader()
return
}
2. Election Suppression
// Ignore duplicate elections during ongoing election
if n.electionInProgress {
return
}
3. Priority-Based IDs
// Assign IDs based on:
// - Hardware capability
// - Network location
// - Data freshness
priority := (hardware * 1000) + (location * 100) + ID
When to Use
✓ Use when:
- Small to medium cluster size
- Clear priority ordering exists
- Network is reliable
- Deterministic leader selection needed
- Simplicity is valued
✗ Avoid when:
- Large clusters (O(n²) messages)
- Network is unreliable
- Frequent failures occur
- Need partition tolerance
- Use Raft/Paxos instead for production
Try It Yourself
- Add priorities - Not just ID-based election
- Implement ring version - Reduce message complexity
- Add network delays - Simulate real networks
- Test split-brain - Handle network partitions
- Compare with Raft - Understand trade-offs
This is part 18 of “Golang Experiments: Classic Concurrency Problems”
Further Reading
- Original Paper: “Elections in a Distributed Computing System” - Garcia-Molina (1982)
- Raft Consensus: More robust alternative
- ZooKeeper: Production leader election
- Chubby: Google’s distributed lock service