← Sieve of Eratosthenes | Series Overview | Collatz Explorer →
The Problem: Rendering Fractals in Parallel
The Mandelbrot set is defined by a simple iterative formula:
- Start with
z = 0 - Repeatedly compute
z = z² + c - If
|z|exceeds 2, the point escapes (not in the set) - Color each pixel by iteration count
The beauty: Each pixel is completely independent. Perfect for parallelism!
The challenge: Some pixels escape in 5 iterations, others take 1000+. This creates load imbalance—some workers finish instantly while others grind away.
How do you distribute work when you don’t know how hard each piece is?
The Sequential Implementation
package main
import (
"image"
"image/color"
"image/png"
"math/cmplx"
"os"
)
const (
width = 2048
height = 2048
maxIter = 256
xmin, xmax = -2.0, 1.0
ymin, ymax = -1.5, 1.5
)
func mandelbrot(c complex128) int {
z := complex(0, 0)
for n := 0; n < maxIter; n++ {
if cmplx.Abs(z) > 2 {
return n
}
z = z*z + c
}
return maxIter
}
func renderSequential() *image.RGBA {
img := image.NewRGBA(image.Rect(0, 0, width, height))
for py := 0; py < height; py++ {
y := float64(py)/height*(ymax-ymin) + ymin
for px := 0; px < width; px++ {
x := float64(px)/width*(xmax-xmin) + xmin
c := complex(x, y)
iter := mandelbrot(c)
img.Set(px, py, iterToColor(iter))
}
}
return img
}
func iterToColor(iter int) color.Color {
if iter == maxIter {
return color.Black // In the set
}
// Color gradient based on iteration count
v := uint8(iter * 255 / maxIter)
return color.RGBA{v, v, 255 - v, 255}
}
func main() {
img := renderSequential()
f, _ := os.Create("mandelbrot.png")
defer f.Close()
png.Encode(f, img)
}
Time: ~850ms (for 2048×2048 image)
Each pixel requires computing the Mandelbrot iteration, completely independent of other pixels. Let’s parallelize!
Naive Parallel: Row-Based Partitioning
Divide rows among workers:
import "sync"
func renderParallelRows(workers int) *image.RGBA {
img := image.NewRGBA(image.Rect(0, 0, width, height))
rowsPerWorker := height / workers
var wg sync.WaitGroup
for w := 0; w < workers; w++ {
wg.Add(1)
startRow := w * rowsPerWorker
endRow := startRow + rowsPerWorker
if w == workers-1 {
endRow = height // Last worker takes remainder
}
go func(startRow, endRow int) {
defer wg.Done()
for py := startRow; py < endRow; py++ {
y := float64(py)/height*(ymax-ymin) + ymin
for px := 0; px < width; px++ {
x := float64(px)/width*(xmax-xmin) + xmin
c := complex(x, y)
iter := mandelbrot(c)
img.Set(px, py, iterToColor(iter))
}
}
}(startRow, endRow)
}
wg.Wait()
return img
}
Benchmark (8 workers):
Sequential: 850ms
Parallel Rows: 420ms
Speedup: 2.0x
Only 2x speedup on 8 cores? We expected ~8x! What’s wrong?
The Load Imbalance Problem
Let’s measure time per row:
func measureRowTimes() {
times := make([]time.Duration, height)
for py := 0; py < height; py++ {
start := time.Now()
y := float64(py)/height*(ymax-ymin) + ymin
for px := 0; px < width; px++ {
x := float64(px)/width*(xmax-xmin) + xmin
c := complex(x, y)
mandelbrot(c)
}
times[py] = time.Since(start)
}
// Find min, max, avg
min, max, sum := times[0], times[0], time.Duration(0)
for _, t := range times {
if t < min {
min = t
}
if t > max {
max = t
}
sum += t
}
avg := sum / time.Duration(len(times))
fmt.Printf("Min: %v\n", min)
fmt.Printf("Max: %v\n", max)
fmt.Printf("Avg: %v\n", avg)
fmt.Printf("Ratio: %.2fx\n", float64(max)/float64(min))
}
Output:
Min: 82µs
Max: 3.2ms
Avg: 415µs
Ratio: 39.02x
The slowest row takes 39x longer than the fastest! Some rows are mostly black (high iteration count), others escape quickly.
When we partition by rows, one worker might get all the slow rows while another gets all the fast rows. Load imbalance.
Solution 1: Fine-Grained Work Distribution
Instead of static partitioning, use a work queue:
type pixel struct {
x, y int
}
func renderWorkQueue(workers int) *image.RGBA {
img := image.NewRGBA(image.Rect(0, 0, width, height))
// Channel of pixels to process
work := make(chan pixel, 1000)
var wg sync.WaitGroup
// Start workers
for w := 0; w < workers; w++ {
wg.Add(1)
go func() {
defer wg.Done()
for p := range work {
x := float64(p.x)/width*(xmax-xmin) + xmin
y := float64(p.y)/height*(ymax-ymin) + ymin
c := complex(x, y)
iter := mandelbrot(c)
img.Set(p.x, p.y, iterToColor(iter))
}
}()
}
// Send all pixels as work
for py := 0; py < height; py++ {
for px := 0; px < width; px++ {
work <- pixel{px, py}
}
}
close(work)
wg.Wait()
return img
}
Benchmark:
Sequential: 850ms
Parallel Rows: 420ms
Work Queue: 180ms
Speedup: 4.7x
Much better! By distributing pixels dynamically, workers stay busy. No worker sits idle while others grind.
But: We’re sending 4 million pixels through a channel. That’s overhead.
Solution 2: Row-Based Work Queue
Balance between row partitioning and pixel-level:
func renderRowQueue(workers int) *image.RGBA {
img := image.NewRGBA(image.Rect(0, 0, width, height))
rows := make(chan int, height)
var wg sync.WaitGroup
// Start workers
for w := 0; w < workers; w++ {
wg.Add(1)
go func() {
defer wg.Done()
for py := range rows {
y := float64(py)/height*(ymax-ymin) + ymin
for px := 0; px < width; px++ {
x := float64(px)/width*(xmax-xmin) + xmin
c := complex(x, y)
iter := mandelbrot(c)
img.Set(px, py, iterToColor(iter))
}
}
}()
}
// Send rows as work
for py := 0; py < height; py++ {
rows <- py
}
close(rows)
wg.Wait()
return img
}
Benchmark:
Work Queue (pixels): 180ms
Work Queue (rows): 145ms
Speedup: 5.9x
Even better! Fewer channel sends (2048 vs. 4 million), better cache locality.
Solution 3: Tile-Based Partitioning
Divide image into tiles, distribute tiles dynamically:
type tile struct {
x, y, width, height int
}
func renderTiles(workers, tileSize int) *image.RGBA {
img := image.NewRGBA(image.Rect(0, 0, width, height))
tiles := make(chan tile, 100)
var wg sync.WaitGroup
// Start workers
for w := 0; w < workers; w++ {
wg.Add(1)
go func() {
defer wg.Done()
for t := range tiles {
for py := t.y; py < t.y+t.height && py < height; py++ {
y := float64(py)/height*(ymax-ymin) + ymin
for px := t.x; px < t.x+t.width && px < width; px++ {
x := float64(px)/width*(xmax-xmin) + xmin
c := complex(x, y)
iter := mandelbrot(c)
img.Set(px, py, iterToColor(iter))
}
}
}
}()
}
// Generate tiles
for y := 0; y < height; y += tileSize {
for x := 0; x < width; x += tileSize {
tiles <- tile{x, y, tileSize, tileSize}
}
}
close(tiles)
wg.Wait()
return img
}
Benchmark (64×64 tiles):
Row Queue: 145ms
Tiles: 130ms
Speedup: 6.5x
Tiles provide better load balancing than rows and better cache locality than individual pixels.
Advanced: Adaptive Tile Sizing
For zoomed-in views, some tiles are all-black (in the set), others are detailed. Adaptive sizing helps:
func shouldSubdivide(t tile, img *image.RGBA, threshold int) bool {
// Sample corners
samples := []pixel{
{t.x, t.y},
{t.x + t.width - 1, t.y},
{t.x, t.y + t.height - 1},
{t.x + t.width - 1, t.y + t.height - 1},
}
iters := make([]int, len(samples))
for i, p := range samples {
x := float64(p.x)/width*(xmax-xmin) + xmin
y := float64(p.y)/height*(ymax-ymin) + ymin
c := complex(x, y)
iters[i] = mandelbrot(c)
}
// If corners have different iterations, subdivide
min, max := iters[0], iters[0]
for _, iter := range iters {
if iter < min {
min = iter
}
if iter > max {
max = iter
}
}
return max-min > threshold
}
This is complex but can improve performance on detailed regions.
Race Condition: Concurrent Image Writes
Wait—we’re calling img.Set(x, y, color) from multiple goroutines. Is that safe?
Answer: Yes, if different goroutines write to different pixels. image.RGBA is backed by a byte slice, and different (x, y) map to different slice indices.
But: If goroutines accidentally write to the same pixel, it’s a data race.
Proof:
func TestConcurrentWrites(t *testing.T) {
img := image.NewRGBA(image.Rect(0, 0, 10, 10))
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func() {
defer wg.Done()
img.Set(5, 5, color.White) // All write to same pixel
}()
}
wg.Wait()
}
Run with: go test -race
Output:
WARNING: DATA RACE
Write at 0x00c0000b0000 by goroutine 7:
...
Data race! Multiple goroutines writing to the same memory.
In our Mandelbrot code, each goroutine writes to different pixels (controlled by work distribution), so it’s safe.
Performance Comparison
func benchmark() {
implementations := []struct {
name string
fn func() *image.RGBA
}{
{"Sequential", renderSequential},
{"Parallel Rows (8)", func() *image.RGBA { return renderParallelRows(8) }},
{"Work Queue Pixels (8)", func() *image.RGBA { return renderWorkQueue(8) }},
{"Work Queue Rows (8)", func() *image.RGBA { return renderRowQueue(8) }},
{"Tiles 64×64 (8)", func() *image.RGBA { return renderTiles(8, 64) }},
{"Tiles 32×32 (8)", func() *image.RA { return renderTiles(8, 32) }},
}
for _, impl := range implementations {
start := time.Now()
img := impl.fn()
duration := time.Since(start)
fmt.Printf("%-25s: %8v\n", impl.name, duration)
// Save image
f, _ := os.Create(impl.name + ".png")
png.Encode(f, img)
f.Close()
}
}
Output (2048×2048, 8 cores):
Sequential : 850ms
Parallel Rows (8) : 420ms (2.0x)
Work Queue Pixels (8) : 180ms (4.7x)
Work Queue Rows (8) : 145ms (5.9x)
Tiles 64×64 (8) : 130ms (6.5x)
Tiles 32×32 (8) : 135ms (6.3x)
Best: Tiles (64×64) with 6.5x speedup on 8 cores.
Scalability: More Workers
Does adding more workers help beyond core count?
func scalability() {
workerCounts := []int{1, 2, 4, 8, 16, 32}
fmt.Println("Workers | Time | Speedup")
fmt.Println("--------|---------|--------")
baseline := 0.0
for _, workers := range workerCounts {
start := time.Now()
renderTiles(workers, 64)
duration := time.Since(start).Seconds()
if workers == 1 {
baseline = duration
}
speedup := baseline / duration
fmt.Printf("%7d | %6.3fs | %.2fx\n", workers, duration, speedup)
}
}
Output (8-core machine):
Workers | Time | Speedup
--------|---------|--------
1 | 0.850s | 1.00x
2 | 0.440s | 1.93x
4 | 0.230s | 3.70x
8 | 0.130s | 6.54x
16 | 0.135s | 6.30x
32 | 0.140s | 6.07x
Diminishing returns beyond core count. More workers don’t help because we’re CPU-bound with limited cores.
Memory Optimization: Shared Color Palette
Instead of creating new color.RGBA for each pixel:
var colorPalette [maxIter + 1]color.Color
func initPalette() {
for i := 0; i <= maxIter; i++ {
if i == maxIter {
colorPalette[i] = color.Black
} else {
v := uint8(i * 255 / maxIter)
colorPalette[i] = color.RGBA{v, v, 255 - v, 255}
}
}
}
func iterToColorFast(iter int) color.Color {
return colorPalette[iter]
}
Benchmark:
Before: 130ms
After: 122ms
Small improvement, but reduces allocations.
Rendering Different Regions
Zooming into interesting regions:
func renderRegion(xmin, xmax, ymin, ymax float64, workers int) *image.RGBA {
// Same code, but use custom bounds instead of constants
}
// Zoom into spiral region
renderRegion(-0.8, -0.7, -0.1, 0.0, 8)
// Zoom into "elephant valley"
renderRegion(0.25, 0.35, 0.0, 0.1, 8)
Different regions have different complexity. Some zoom levels are mostly in the set (slow), others escape quickly (fast).
Progress Reporting
For long renders, show progress:
func renderWithProgress(workers int) *image.RGBA {
img := image.NewRGBA(image.Rect(0, 0, width, height))
rows := make(chan int, height)
progress := make(chan int, height)
var wg sync.WaitGroup
// Workers
for w := 0; w < workers; w++ {
wg.Add(1)
go func() {
defer wg.Done()
for py := range rows {
y := float64(py)/height*(ymax-ymin) + ymin
for px := 0; px < width; px++ {
x := float64(px)/width*(xmax-xmin) + xmin
c := complex(x, y)
iter := mandelbrot(c)
img.Set(px, py, iterToColor(iter))
}
progress <- 1
}
}()
}
// Progress reporter
go func() {
completed := 0
for range progress {
completed++
if completed%100 == 0 {
pct := 100 * completed / height
fmt.Printf("\rProgress: %d%%", pct)
}
}
fmt.Println()
}()
// Send work
for py := 0; py < height; py++ {
rows <- py
}
close(rows)
wg.Wait()
close(progress)
return img
}
Output:
Progress: 24%
Progress: 49%
Progress: 74%
Progress: 99%
Testing Visual Correctness
func TestVisualCorrectness(t *testing.T) {
seq := renderSequential()
par := renderTiles(8, 64)
// Compare pixel-by-pixel
for y := 0; y < height; y++ {
for x := 0; x < width; x++ {
c1 := seq.At(x, y)
c2 := par.At(x, y)
if c1 != c2 {
t.Errorf("Pixel (%d, %d) differs: %v vs %v", x, y, c1, c2)
return
}
}
}
}
This ensures parallel version produces identical output.
Real-World Applications
- Fractal Rendering: Mandelbrot, Julia sets, 3D fractals
- Scientific Visualization: Simulations, data plotting
- Ray Tracing: Rendering 3D scenes
- Image Processing: Filters, convolutions, transformations
- Video Encoding: Frame processing, compression
Common Pitfalls
- Static partitioning: Causes load imbalance
- Too fine-grained: Pixel-level has too much overhead
- Too coarse-grained: Rows/columns have load imbalance
- Assuming linear speedup: Overhead limits scaling
- Ignoring cache: Tiles improve locality
Best Practices
- Dynamic work distribution: Use work queues
- Balance granularity: Tiles/rows, not individual pixels
- Measure load imbalance: Profile to identify bottlenecks
- Match workers to cores: More isn’t always better
- Pre-allocate: Avoid allocations in hot loops
- Test correctness: Parallel must match sequential
Conclusion
The Mandelbrot set teaches us about load balancing in parallel computing:
- Independent work is easy to parallelize: Each pixel is separate
- Uniform work is rare: Some pixels take 100x longer
- Static partitioning fails: Row-based distribution has imbalance
- Dynamic distribution wins: Work queues adapt to varying load
- Granularity matters: Tiles balance overhead and balance
The key insight: Parallelism is only as good as your load balancing.
For embarrassingly parallel problems with uniform work (Monte Carlo), static partitioning is fine. For non-uniform work (Mandelbrot), you need dynamic distribution.
Techniques we used:
- Work queues (rows, tiles)
- Dynamic task stealing (implicit via channels)
- Granularity tuning (pixels → rows → tiles)
- Progress reporting
- Performance measurement
These apply far beyond fractals—any time work items have variable cost, load balancing is critical.
Further Reading
← Sieve of Eratosthenes | Series Overview | Collatz Explorer →