← Login Counter | Series Overview | Sieve of Eratosthenes →


The Problem: Computing Pi by Throwing Darts

Imagine a square dartboard with a circle inscribed inside it. Throw random darts at the square. The ratio of darts landing inside the circle to total darts thrown approaches π/4.

Why? Mathematics:

  • Square side length: 2 (from -1 to 1)
  • Square area: 4
  • Circle radius: 1
  • Circle area: π × 1² = π
  • Ratio: π/4

Throw 1 million darts, multiply by 4, and you’ve estimated π. More darts = better estimate. This is Monte Carlo simulation: using randomness to solve deterministic problems.

The beauty? Each dart throw is independent. Embarrassingly parallel—the holy grail of concurrent programming.

The Sequential Implementation

package main

import (
    "fmt"
    "math"
    "math/rand"
    "time"
)

func estimatePi(samples int) float64 {
    inside := 0

    for i := 0; i < samples; i++ {
        x := rand.Float64()*2 - 1 // Random x in [-1, 1]
        y := rand.Float64()*2 - 1 // Random y in [-1, 1]

        // Check if point is inside unit circle
        if x*x+y*y <= 1 {
            inside++
        }
    }

    return 4 * float64(inside) / float64(samples)
}

func main() {
    rand.Seed(time.Now().UnixNano())

    samples := 1000000
    start := time.Now()

    pi := estimatePi(samples)

    duration := time.Since(start)
    error := math.Abs(pi - math.Pi)

    fmt.Printf("Estimated π: %f\n", pi)
    fmt.Printf("Actual π:    %f\n", math.Pi)
    fmt.Printf("Error:       %f\n", error)
    fmt.Printf("Time:        %v\n", duration)
}

Output:

Estimated π: 3.141273
Actual π:    3.141593
Error:       0.000320
Time:        12.5ms

Not bad! But we can do better with parallelism.

The Naive Parallel Implementation

import "sync"

func estimatePiParallel(samples, workers int) float64 {
    samplesPerWorker := samples / workers
    results := make(chan int, workers)
    var wg sync.WaitGroup

    for i := 0; i < workers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()

            inside := 0
            for j := 0; j < samplesPerWorker; j++ {
                x := rand.Float64()*2 - 1
                y := rand.Float64()*2 - 1
                if x*x+y*y <= 1 {
                    inside++
                }
            }

            results <- inside
        }()
    }

    go func() {
        wg.Wait()
        close(results)
    }()

    totalInside := 0
    for inside := range results {
        totalInside += inside
    }

    return 4 * float64(totalInside) / float64(samples)
}

Benchmark:

func BenchmarkSequential(b *testing.B) {
    rand.Seed(1)
    for i := 0; i < b.N; i++ {
        estimatePi(1000000)
    }
}

func BenchmarkParallel(b *testing.B) {
    rand.Seed(1)
    for i := 0; i < b.N; i++ {
        estimatePiParallel(1000000, 8)
    }
}

Results (8 cores):

BenchmarkSequential-8    100    12.5 ms/op
BenchmarkParallel-8      300     4.2 ms/op

3x speedup! But wait… there’s a problem.

The Hidden Bug: Random Number Correlation

Run multiple times:

for i := 0; i < 5; i++ {
    pi := estimatePiParallel(1000000, 8)
    fmt.Printf("Run %d: %f\n", i, pi)
}

Output:

Run 0: 3.141200
Run 1: 3.141200
Run 2: 3.141200
Run 3: 3.141200
Run 4: 3.141200

Identical results! The random numbers aren’t random—they’re correlated.

Why? Global Random State

rand.Float64() uses a global random generator. All goroutines share it, and rand uses a mutex internally. Under contention, goroutines get the same sequence of random numbers.

Proof:

func TestRandomCorrelation(t *testing.T) {
    rand.Seed(1)

    var wg sync.WaitGroup
    results := make([][]float64, 4)

    for i := 0; i < 4; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            results[id] = make([]float64, 10)
            for j := 0; j < 10; j++ {
                results[id][j] = rand.Float64()
            }
        }(i)
    }

    wg.Wait()

    // Check for identical sequences
    for i := 1; i < len(results); i++ {
        if reflect.DeepEqual(results[0], results[i]) {
            t.Errorf("Goroutine 0 and %d have identical sequences!", i)
        }
    }
}

This test fails! Goroutines get correlated random numbers.

Solution 1: Per-Goroutine Random Source

func estimatePiParallelFixed(samples, workers int) float64 {
    samplesPerWorker := samples / workers
    results := make(chan int, workers)
    var wg sync.WaitGroup

    for i := 0; i < workers; i++ {
        wg.Add(1)
        go func(seed int64) {
            defer wg.Done()

            // Each goroutine has its own random source
            rng := rand.New(rand.NewSource(seed))

            inside := 0
            for j := 0; j < samplesPerWorker; j++ {
                x := rng.Float64()*2 - 1
                y := rng.Float64()*2 - 1
                if x*x+y*y <= 1 {
                    inside++
                }
            }

            results <- inside
        }(time.Now().UnixNano() + int64(i))
    }

    go func() {
        wg.Wait()
        close(results)
    }()

    totalInside := 0
    for inside := range results {
        totalInside += inside
    }

    return 4 * float64(totalInside) / float64(samples)
}

Key change: rand.New(rand.NewSource(seed)) creates an independent random generator for each goroutine.

Now runs produce different results:

Run 0: 3.141592
Run 1: 3.142104
Run 2: 3.141088
Run 3: 3.141796
Run 4: 3.140912

Much better!

Solution 2: Worker Pool Pattern

Instead of spinning up N goroutines, use a worker pool:

type task struct {
    samples int
    seed    int64
}

type result struct {
    inside int
}

func estimatePiWorkerPool(samples, workers int) float64 {
    tasks := make(chan task, workers)
    results := make(chan result, workers)

    // Start workers
    var wg sync.WaitGroup
    for i := 0; i < workers; i++ {
        wg.Add(1)
        go worker(tasks, results, &wg)
    }

    // Send tasks
    samplesPerWorker := samples / workers
    go func() {
        for i := 0; i < workers; i++ {
            tasks <- task{
                samples: samplesPerWorker,
                seed:    time.Now().UnixNano() + int64(i),
            }
        }
        close(tasks)
    }()

    // Close results when workers done
    go func() {
        wg.Wait()
        close(results)
    }()

    // Collect results
    totalInside := 0
    for r := range results {
        totalInside += r.inside
    }

    return 4 * float64(totalInside) / float64(samples)
}

func worker(tasks <-chan task, results chan<- result, wg *sync.WaitGroup) {
    defer wg.Done()

    for t := range tasks {
        rng := rand.New(rand.NewSource(t.seed))

        inside := 0
        for i := 0; i < t.samples; i++ {
            x := rng.Float64()*2 - 1
            y := rng.Float64()*2 - 1
            if x*x+y*y <= 1 {
                inside++
            }
        }

        results <- result{inside: inside}
    }
}

This pattern is more flexible—you can dynamically add work without creating more goroutines.

Optimization: Batch Communication

Sending results one-at-a-time has channel overhead. Batch them:

type batchResult struct {
    inside int
    total  int
}

func estimatePiBatched(samples, workers, batchSize int) float64 {
    results := make(chan batchResult, workers)
    var wg sync.WaitGroup

    for i := 0; i < workers; i++ {
        wg.Add(1)
        go func(seed int64) {
            defer wg.Done()

            rng := rand.New(rand.NewSource(seed))
            inside := 0
            total := 0

            samplesPerWorker := samples / workers
            for j := 0; j < samplesPerWorker; j++ {
                x := rng.Float64()*2 - 1
                y := rng.Float64()*2 - 1
                if x*x+y*y <= 1 {
                    inside++
                }
                total++

                // Send batch every batchSize samples
                if total%batchSize == 0 {
                    results <- batchResult{inside, total}
                    inside = 0
                    total = 0
                }
            }

            // Send remaining
            if total > 0 {
                results <- batchResult{inside, total}
            }
        }(time.Now().UnixNano() + int64(i))
    }

    go func() {
        wg.Wait()
        close(results)
    }()

    totalInside := 0
    totalSamples := 0
    for r := range results {
        totalInside += r.inside
        totalSamples += r.total
    }

    return 4 * float64(totalInside) / float64(totalSamples)
}

Benchmark:

BenchmarkParallelFixed-8     300     4.2 ms/op
BenchmarkBatched-8           350     3.8 ms/op

Modest improvement, but batching reduces channel contention.

Convergence Analysis

How does accuracy improve with more samples?

func analyzeConvergence() {
    sampleCounts := []int{100, 1000, 10000, 100000, 1000000, 10000000}

    for _, samples := range sampleCounts {
        errors := make([]float64, 100)

        for i := 0; i < 100; i++ {
            pi := estimatePiParallelFixed(samples, 8)
            errors[i] = math.Abs(pi - math.Pi)
        }

        mean := 0.0
        for _, e := range errors {
            mean += e
        }
        mean /= float64(len(errors))

        variance := 0.0
        for _, e := range errors {
            variance += (e - mean) * (e - mean)
        }
        variance /= float64(len(errors))
        stddev := math.Sqrt(variance)

        fmt.Printf("Samples: %10d | Mean Error: %.6f | Std Dev: %.6f\n",
            samples, mean, stddev)
    }
}

Output:

Samples:        100 | Mean Error: 0.048123 | Std Dev: 0.037821
Samples:       1000 | Mean Error: 0.014562 | Std Dev: 0.011943
Samples:      10000 | Mean Error: 0.004712 | Std Dev: 0.003829
Samples:     100000 | Mean Error: 0.001498 | Std Dev: 0.001201
Samples:    1000000 | Mean Error: 0.000473 | Std Dev: 0.000381
Samples:   10000000 | Mean Error: 0.000149 | Std Dev: 0.000120

Error decreases as 1/√N. To halve the error, you need 4x the samples. This is the fundamental property of Monte Carlo methods.

Visualizing the Estimation

func visualizePi(samples int) {
    rng := rand.New(rand.NewSource(time.Now().UnixNano()))

    fmt.Println("Dart positions (first 50):")
    inside := 0
    for i := 0; i < samples; i++ {
        x := rng.Float64()*2 - 1
        y := rng.Float64()*2 - 1

        inCircle := x*x+y*y <= 1
        if inCircle {
            inside++
        }

        if i < 50 {
            marker := "o" // inside
            if !inCircle {
                marker = "x" // outside
            }
            fmt.Printf("(%6.3f, %6.3f) %s\n", x, y, marker)
        }
    }

    pi := 4 * float64(inside) / float64(samples)
    fmt.Printf("\nEstimated π: %f (from %d samples)\n", pi, samples)
}

Real-World Parallel Performance

func scalabilityTest() {
    samples := 100000000 // 100 million
    workerCounts := []int{1, 2, 4, 8, 16}

    fmt.Println("Workers | Time      | Speedup")
    fmt.Println("--------|-----------|--------")

    baseTime := 0.0

    for _, workers := range workerCounts {
        start := time.Now()
        estimatePiParallelFixed(samples, workers)
        duration := time.Since(start).Seconds()

        if workers == 1 {
            baseTime = duration
        }

        speedup := baseTime / duration

        fmt.Printf("%7d | %8.3fs | %.2fx\n", workers, duration, speedup)
    }
}

Output (8-core machine):

Workers | Time      | Speedup
--------|-----------|--------
      1 |    2.450s | 1.00x
      2 |    1.280s | 1.91x
      4 |    0.670s | 3.66x
      8 |    0.390s | 6.28x
     16 |    0.410s | 5.98x

Observations:

  • Near-linear scaling up to core count (8)
  • Beyond physical cores, speedup plateaus (or decreases due to overhead)
  • This is embarrassingly parallel at its finest!

Advanced: Quasi-Random Sequences

Random numbers have clustering. Quasi-random sequences (low-discrepancy sequences) cover the space more uniformly:

// Halton sequence (simplified)
func haltonSequence(index, base int) float64 {
    result := 0.0
    f := 1.0 / float64(base)
    i := index

    for i > 0 {
        result += f * float64(i%base)
        i /= base
        f /= float64(base)
    }

    return result
}

func estimatePiQuasiRandom(samples int) float64 {
    inside := 0

    for i := 0; i < samples; i++ {
        // Use Halton sequence with bases 2 and 3
        x := haltonSequence(i, 2)*2 - 1
        y := haltonSequence(i, 3)*2 - 1

        if x*x+y*y <= 1 {
            inside++
        }
    }

    return 4 * float64(inside) / float64(samples)
}

Comparison (10,000 samples):

Monte Carlo:    π ≈ 3.1372 (error: 0.0044)
Quasi-Random:   π ≈ 3.1420 (error: 0.0004)

Quasi-random sequences converge much faster but are harder to parallelize (they’re sequential by nature).

Testing Strategy

func TestPiEstimationAccuracy(t *testing.T) {
    samples := 1000000
    trials := 100
    tolerance := 0.01 // 1% error tolerance

    errors := make([]float64, trials)

    for i := 0; i < trials; i++ {
        pi := estimatePiParallelFixed(samples, 8)
        errors[i] = math.Abs(pi - math.Pi)
    }

    // Calculate mean error
    mean := 0.0
    for _, e := range errors {
        mean += e
    }
    mean /= float64(trials)

    if mean > tolerance {
        t.Errorf("Mean error %.4f exceeds tolerance %.4f", mean, tolerance)
    }

    // Check that most trials are within tolerance
    withinTolerance := 0
    for _, e := range errors {
        if e <= tolerance {
            withinTolerance++
        }
    }

    if float64(withinTolerance)/float64(trials) < 0.95 {
        t.Errorf("Only %d%% of trials within tolerance (expected >95%%)",
            withinTolerance*100/trials)
    }
}

Real-World Applications

  1. Physics Simulations: Particle physics, radiation transport
  2. Finance: Option pricing, risk assessment, portfolio optimization
  3. Computer Graphics: Global illumination, path tracing
  4. Machine Learning: Reinforcement learning, Bayesian inference
  5. Operations Research: Queueing theory, inventory management

Common Pitfalls

  1. Using global random generator: Causes correlation
  2. Forgetting to seed: Results are deterministic
  3. Insufficient samples: Monte Carlo is slow to converge
  4. Over-parallelizing: Beyond core count, overhead dominates
  5. Ignoring variance: One run isn’t enough; measure distribution

Best Practices

  1. Per-goroutine random sources: Avoid correlation
  2. Batch communication: Reduce channel overhead
  3. Match workers to cores: Diminishing returns beyond physical cores
  4. Measure convergence: Plot error vs. samples
  5. Run multiple trials: Monte Carlo is probabilistic
  6. Consider quasi-random: For faster convergence (if applicable)

Conclusion

Monte Carlo Pi estimation is a perfect introduction to parallel programming:

  • Embarrassingly parallel: No synchronization needed
  • Easy to understand: Throw darts, count hits
  • Measurable improvement: Clear speedup with more cores
  • Surprising depth: Random number correlation, convergence analysis

The key lessons:

  • Independence is gold: No shared state = perfect parallelism
  • Random isn’t always random: Goroutines can get correlated sequences
  • Convergence is slow: Monte Carlo needs lots of samples
  • Parallelism has limits: Beyond core count, overhead dominates

Monte Carlo methods power everything from financial models to video game graphics. Understanding how to parallelize them efficiently is a fundamental skill in modern computing.

And remember: You can estimate π by throwing darts, but Go makes it fast.


Further Reading


← Login Counter | Series Overview | Sieve of Eratosthenes →