← Sieve of Eratosthenes | Series Overview | Collatz Explorer →

The Problem: Rendering Fractals in Parallel

The Mandelbrot set is defined by a simple iterative formula:

Start with z = 0
Repeatedly compute z = z² + c
If |z| exceeds 2, the point escapes (not in the set)
Color each pixel by iteration count

The beauty: Each pixel is completely independent. Perfect for parallelism!

The challenge: Some pixels escape in 5 iterations, others take 1000+. This creates load imbalance—some workers finish instantly while others grind away.

How do you distribute work when you don’t know how hard each piece is?

The Sequential Implementation

package main

import (
    "image"
    "image/color"
    "image/png"
    "math/cmplx"
    "os"
)

const (
    width     = 2048
    height    = 2048
    maxIter   = 256
    xmin, xmax = -2.0, 1.0
    ymin, ymax = -1.5, 1.5
)

func mandelbrot(c complex128) int {
    z := complex(0, 0)

    for n := 0; n < maxIter; n++ {
        if cmplx.Abs(z) > 2 {
            return n
        }
        z = z*z + c
    }

    return maxIter
}

func renderSequential() *image.RGBA {
    img := image.NewRGBA(image.Rect(0, 0, width, height))

    for py := 0; py < height; py++ {
        y := float64(py)/height*(ymax-ymin) + ymin

        for px := 0; px < width; px++ {
            x := float64(px)/width*(xmax-xmin) + xmin
            c := complex(x, y)

            iter := mandelbrot(c)
            img.Set(px, py, iterToColor(iter))
        }
    }

    return img
}

func iterToColor(iter int) color.Color {
    if iter == maxIter {
        return color.Black // In the set
    }

    // Color gradient based on iteration count
    v := uint8(iter * 255 / maxIter)
    return color.RGBA{v, v, 255 - v, 255}
}

func main() {
    img := renderSequential()

    f, _ := os.Create("mandelbrot.png")
    defer f.Close()

    png.Encode(f, img)
}

Time: ~850ms (for 2048×2048 image)

Each pixel requires computing the Mandelbrot iteration, completely independent of other pixels. Let’s parallelize!

Naive Parallel: Row-Based Partitioning

Divide rows among workers:

import "sync"

func renderParallelRows(workers int) *image.RGBA {
    img := image.NewRGBA(image.Rect(0, 0, width, height))

    rowsPerWorker := height / workers
    var wg sync.WaitGroup

    for w := 0; w < workers; w++ {
        wg.Add(1)

        startRow := w * rowsPerWorker
        endRow := startRow + rowsPerWorker
        if w == workers-1 {
            endRow = height // Last worker takes remainder
        }

        go func(startRow, endRow int) {
            defer wg.Done()

            for py := startRow; py < endRow; py++ {
                y := float64(py)/height*(ymax-ymin) + ymin

                for px := 0; px < width; px++ {
                    x := float64(px)/width*(xmax-xmin) + xmin
                    c := complex(x, y)

                    iter := mandelbrot(c)
                    img.Set(px, py, iterToColor(iter))
                }
            }
        }(startRow, endRow)
    }

    wg.Wait()
    return img
}

Benchmark (8 workers):

Sequential:    850ms
Parallel Rows: 420ms
Speedup:       2.0x

Only 2x speedup on 8 cores? We expected ~8x! What’s wrong?

The Load Imbalance Problem

Let’s measure time per row:

func measureRowTimes() {
    times := make([]time.Duration, height)

    for py := 0; py < height; py++ {
        start := time.Now()

        y := float64(py)/height*(ymax-ymin) + ymin
        for px := 0; px < width; px++ {
            x := float64(px)/width*(xmax-xmin) + xmin
            c := complex(x, y)
            mandelbrot(c)
        }

        times[py] = time.Since(start)
    }

    // Find min, max, avg
    min, max, sum := times[0], times[0], time.Duration(0)
    for _, t := range times {
        if t < min {
            min = t
        }
        if t > max {
            max = t
        }
        sum += t
    }
    avg := sum / time.Duration(len(times))

    fmt.Printf("Min: %v\n", min)
    fmt.Printf("Max: %v\n", max)
    fmt.Printf("Avg: %v\n", avg)
    fmt.Printf("Ratio: %.2fx\n", float64(max)/float64(min))
}

Output:

Min: 82µs
Max: 3.2ms
Avg: 415µs
Ratio: 39.02x

The slowest row takes 39x longer than the fastest! Some rows are mostly black (high iteration count), others escape quickly.

When we partition by rows, one worker might get all the slow rows while another gets all the fast rows. Load imbalance.

Solution 1: Fine-Grained Work Distribution

Instead of static partitioning, use a work queue:

type pixel struct {
    x, y int
}

func renderWorkQueue(workers int) *image.RGBA {
    img := image.NewRGBA(image.Rect(0, 0, width, height))

    // Channel of pixels to process
    work := make(chan pixel, 1000)

    var wg sync.WaitGroup

    // Start workers
    for w := 0; w < workers; w++ {
        wg.Add(1)
        go func() {
            defer wg.Done()

            for p := range work {
                x := float64(p.x)/width*(xmax-xmin) + xmin
                y := float64(p.y)/height*(ymax-ymin) + ymin
                c := complex(x, y)

                iter := mandelbrot(c)
                img.Set(p.x, p.y, iterToColor(iter))
            }
        }()
    }

    // Send all pixels as work
    for py := 0; py < height; py++ {
        for px := 0; px < width; px++ {
            work <- pixel{px, py}
        }
    }

    close(work)
    wg.Wait()

    return img
}

Benchmark:

Sequential:     850ms
Parallel Rows:  420ms
Work Queue:     180ms
Speedup:        4.7x

Much better! By distributing pixels dynamically, workers stay busy. No worker sits idle while others grind.

But: We’re sending 4 million pixels through a channel. That’s overhead.

Solution 2: Row-Based Work Queue

Balance between row partitioning and pixel-level:

func renderRowQueue(workers int) *image.RGBA {
    img := image.NewRGBA(image.Rect(0, 0, width, height))

    rows := make(chan int, height)
    var wg sync.WaitGroup

    // Start workers
    for w := 0; w < workers; w++ {
        wg.Add(1)
        go func() {
            defer wg.Done()

            for py := range rows {
                y := float64(py)/height*(ymax-ymin) + ymin

                for px := 0; px < width; px++ {
                    x := float64(px)/width*(xmax-xmin) + xmin
                    c := complex(x, y)

                    iter := mandelbrot(c)
                    img.Set(px, py, iterToColor(iter))
                }
            }
        }()
    }

    // Send rows as work
    for py := 0; py < height; py++ {
        rows <- py
    }

    close(rows)
    wg.Wait()

    return img
}

Benchmark:

Work Queue (pixels):  180ms
Work Queue (rows):    145ms
Speedup:              5.9x

Even better! Fewer channel sends (2048 vs. 4 million), better cache locality.

Solution 3: Tile-Based Partitioning

Divide image into tiles, distribute tiles dynamically:

type tile struct {
    x, y, width, height int
}

func renderTiles(workers, tileSize int) *image.RGBA {
    img := image.NewRGBA(image.Rect(0, 0, width, height))

    tiles := make(chan tile, 100)
    var wg sync.WaitGroup

    // Start workers
    for w := 0; w < workers; w++ {
        wg.Add(1)
        go func() {
            defer wg.Done()

            for t := range tiles {
                for py := t.y; py < t.y+t.height && py < height; py++ {
                    y := float64(py)/height*(ymax-ymin) + ymin

                    for px := t.x; px < t.x+t.width && px < width; px++ {
                        x := float64(px)/width*(xmax-xmin) + xmin
                        c := complex(x, y)

                        iter := mandelbrot(c)
                        img.Set(px, py, iterToColor(iter))
                    }
                }
            }
        }()
    }

    // Generate tiles
    for y := 0; y < height; y += tileSize {
        for x := 0; x < width; x += tileSize {
            tiles <- tile{x, y, tileSize, tileSize}
        }
    }

    close(tiles)
    wg.Wait()

    return img
}

Benchmark (64×64 tiles):

Row Queue:  145ms
Tiles:      130ms
Speedup:    6.5x

Tiles provide better load balancing than rows and better cache locality than individual pixels.

Advanced: Adaptive Tile Sizing

For zoomed-in views, some tiles are all-black (in the set), others are detailed. Adaptive sizing helps:

func shouldSubdivide(t tile, img *image.RGBA, threshold int) bool {
    // Sample corners
    samples := []pixel{
        {t.x, t.y},
        {t.x + t.width - 1, t.y},
        {t.x, t.y + t.height - 1},
        {t.x + t.width - 1, t.y + t.height - 1},
    }

    iters := make([]int, len(samples))
    for i, p := range samples {
        x := float64(p.x)/width*(xmax-xmin) + xmin
        y := float64(p.y)/height*(ymax-ymin) + ymin
        c := complex(x, y)
        iters[i] = mandelbrot(c)
    }

    // If corners have different iterations, subdivide
    min, max := iters[0], iters[0]
    for _, iter := range iters {
        if iter < min {
            min = iter
        }
        if iter > max {
            max = iter
        }
    }

    return max-min > threshold
}

This is complex but can improve performance on detailed regions.

Race Condition: Concurrent Image Writes

Wait—we’re calling img.Set(x, y, color) from multiple goroutines. Is that safe?

Answer: Yes, if different goroutines write to different pixels. image.RGBA is backed by a byte slice, and different (x, y) map to different slice indices.

But: If goroutines accidentally write to the same pixel, it’s a data race.

Proof:

func TestConcurrentWrites(t *testing.T) {
    img := image.NewRGBA(image.Rect(0, 0, 10, 10))

    var wg sync.WaitGroup
    for i := 0; i < 100; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            img.Set(5, 5, color.White) // All write to same pixel
        }()
    }

    wg.Wait()
}

Run with: go test -race

Output:

WARNING: DATA RACE
Write at 0x00c0000b0000 by goroutine 7:
...

Data race! Multiple goroutines writing to the same memory.

In our Mandelbrot code, each goroutine writes to different pixels (controlled by work distribution), so it’s safe.

Performance Comparison

func benchmark() {
    implementations := []struct {
        name string
        fn   func() *image.RGBA
    }{
        {"Sequential", renderSequential},
        {"Parallel Rows (8)", func() *image.RGBA { return renderParallelRows(8) }},
        {"Work Queue Pixels (8)", func() *image.RGBA { return renderWorkQueue(8) }},
        {"Work Queue Rows (8)", func() *image.RGBA { return renderRowQueue(8) }},
        {"Tiles 64×64 (8)", func() *image.RGBA { return renderTiles(8, 64) }},
        {"Tiles 32×32 (8)", func() *image.RA { return renderTiles(8, 32) }},
    }

    for _, impl := range implementations {
        start := time.Now()
        img := impl.fn()
        duration := time.Since(start)

        fmt.Printf("%-25s: %8v\n", impl.name, duration)

        // Save image
        f, _ := os.Create(impl.name + ".png")
        png.Encode(f, img)
        f.Close()
    }
}

Output (2048×2048, 8 cores):

Sequential               :    850ms
Parallel Rows (8)        :    420ms  (2.0x)
Work Queue Pixels (8)    :    180ms  (4.7x)
Work Queue Rows (8)      :    145ms  (5.9x)
Tiles 64×64 (8)          :    130ms  (6.5x)
Tiles 32×32 (8)          :    135ms  (6.3x)

Best: Tiles (64×64) with 6.5x speedup on 8 cores.

Scalability: More Workers

Does adding more workers help beyond core count?

func scalability() {
    workerCounts := []int{1, 2, 4, 8, 16, 32}

    fmt.Println("Workers | Time    | Speedup")
    fmt.Println("--------|---------|--------")

    baseline := 0.0

    for _, workers := range workerCounts {
        start := time.Now()
        renderTiles(workers, 64)
        duration := time.Since(start).Seconds()

        if workers == 1 {
            baseline = duration
        }

        speedup := baseline / duration

        fmt.Printf("%7d | %6.3fs | %.2fx\n", workers, duration, speedup)
    }
}

Output (8-core machine):

Workers | Time    | Speedup
--------|---------|--------
      1 |  0.850s | 1.00x
      2 |  0.440s | 1.93x
      4 |  0.230s | 3.70x
      8 |  0.130s | 6.54x
     16 |  0.135s | 6.30x
     32 |  0.140s | 6.07x

Diminishing returns beyond core count. More workers don’t help because we’re CPU-bound with limited cores.

Memory Optimization: Shared Color Palette

Instead of creating new color.RGBA for each pixel:

var colorPalette [maxIter + 1]color.Color

func initPalette() {
    for i := 0; i <= maxIter; i++ {
        if i == maxIter {
            colorPalette[i] = color.Black
        } else {
            v := uint8(i * 255 / maxIter)
            colorPalette[i] = color.RGBA{v, v, 255 - v, 255}
        }
    }
}

func iterToColorFast(iter int) color.Color {
    return colorPalette[iter]
}

Benchmark:

Before: 130ms
After:  122ms

Small improvement, but reduces allocations.

Rendering Different Regions

Zooming into interesting regions:

func renderRegion(xmin, xmax, ymin, ymax float64, workers int) *image.RGBA {
    // Same code, but use custom bounds instead of constants
}

// Zoom into spiral region
renderRegion(-0.8, -0.7, -0.1, 0.0, 8)

// Zoom into "elephant valley"
renderRegion(0.25, 0.35, 0.0, 0.1, 8)

Different regions have different complexity. Some zoom levels are mostly in the set (slow), others escape quickly (fast).

Progress Reporting

For long renders, show progress:

func renderWithProgress(workers int) *image.RGBA {
    img := image.NewRGBA(image.Rect(0, 0, width, height))

    rows := make(chan int, height)
    progress := make(chan int, height)

    var wg sync.WaitGroup

    // Workers
    for w := 0; w < workers; w++ {
        wg.Add(1)
        go func() {
            defer wg.Done()

            for py := range rows {
                y := float64(py)/height*(ymax-ymin) + ymin

                for px := 0; px < width; px++ {
                    x := float64(px)/width*(xmax-xmin) + xmin
                    c := complex(x, y)

                    iter := mandelbrot(c)
                    img.Set(px, py, iterToColor(iter))
                }

                progress <- 1
            }
        }()
    }

    // Progress reporter
    go func() {
        completed := 0
        for range progress {
            completed++
            if completed%100 == 0 {
                pct := 100 * completed / height
                fmt.Printf("\rProgress: %d%%", pct)
            }
        }
        fmt.Println()
    }()

    // Send work
    for py := 0; py < height; py++ {
        rows <- py
    }

    close(rows)
    wg.Wait()
    close(progress)

    return img
}

Output:

Progress: 24%
Progress: 49%
Progress: 74%
Progress: 99%

Testing Visual Correctness

func TestVisualCorrectness(t *testing.T) {
    seq := renderSequential()
    par := renderTiles(8, 64)

    // Compare pixel-by-pixel
    for y := 0; y < height; y++ {
        for x := 0; x < width; x++ {
            c1 := seq.At(x, y)
            c2 := par.At(x, y)

            if c1 != c2 {
                t.Errorf("Pixel (%d, %d) differs: %v vs %v", x, y, c1, c2)
                return
            }
        }
    }
}

This ensures parallel version produces identical output.

Real-World Applications

Fractal Rendering: Mandelbrot, Julia sets, 3D fractals
Scientific Visualization: Simulations, data plotting
Ray Tracing: Rendering 3D scenes
Image Processing: Filters, convolutions, transformations
Video Encoding: Frame processing, compression

Common Pitfalls

Static partitioning: Causes load imbalance
Too fine-grained: Pixel-level has too much overhead
Too coarse-grained: Rows/columns have load imbalance
Assuming linear speedup: Overhead limits scaling
Ignoring cache: Tiles improve locality

Best Practices

Dynamic work distribution: Use work queues
Balance granularity: Tiles/rows, not individual pixels
Measure load imbalance: Profile to identify bottlenecks
Match workers to cores: More isn’t always better
Pre-allocate: Avoid allocations in hot loops
Test correctness: Parallel must match sequential

Conclusion

The Mandelbrot set teaches us about load balancing in parallel computing:

Independent work is easy to parallelize: Each pixel is separate
Uniform work is rare: Some pixels take 100x longer
Static partitioning fails: Row-based distribution has imbalance
Dynamic distribution wins: Work queues adapt to varying load
Granularity matters: Tiles balance overhead and balance

The key insight: Parallelism is only as good as your load balancing.

For embarrassingly parallel problems with uniform work (Monte Carlo), static partitioning is fine. For non-uniform work (Mandelbrot), you need dynamic distribution.

Techniques we used:

Work queues (rows, tiles)
Dynamic task stealing (implicit via channels)
Granularity tuning (pixels → rows → tiles)
Progress reporting
Performance measurement

These apply far beyond fractals—any time work items have variable cost, load balancing is critical.

Go Concurrency Pattern: The Mandelbrot Set

The Problem: Rendering Fractals in Parallel

The Sequential Implementation

Naive Parallel: Row-Based Partitioning

The Load Imbalance Problem

Solution 1: Fine-Grained Work Distribution

Solution 2: Row-Based Work Queue

Solution 3: Tile-Based Partitioning

Advanced: Adaptive Tile Sizing

Race Condition: Concurrent Image Writes

Performance Comparison

Scalability: More Workers

Memory Optimization: Shared Color Palette

Rendering Different Regions

Progress Reporting

Testing Visual Correctness

Real-World Applications

Common Pitfalls

Best Practices

Conclusion

Further Reading

The Problem: Rendering Fractals in Parallel#

The Sequential Implementation#

Naive Parallel: Row-Based Partitioning#

The Load Imbalance Problem#

Solution 1: Fine-Grained Work Distribution#

Solution 2: Row-Based Work Queue#

Solution 3: Tile-Based Partitioning#

Advanced: Adaptive Tile Sizing#

Race Condition: Concurrent Image Writes#

Performance Comparison#

Scalability: More Workers#

Memory Optimization: Shared Color Palette#

Rendering Different Regions#

Progress Reporting#

Testing Visual Correctness#

Real-World Applications#

Common Pitfalls#

Best Practices#

Conclusion#

Further Reading#

The Problem: Rendering Fractals in Parallel

The Sequential Implementation

Naive Parallel: Row-Based Partitioning

The Load Imbalance Problem

Solution 1: Fine-Grained Work Distribution

Solution 2: Row-Based Work Queue

Solution 3: Tile-Based Partitioning

Advanced: Adaptive Tile Sizing

Race Condition: Concurrent Image Writes

Performance Comparison

Scalability: More Workers

Memory Optimization: Shared Color Palette

Rendering Different Regions

Progress Reporting

Testing Visual Correctness

Real-World Applications

Common Pitfalls

Best Practices

Conclusion

Further Reading