Example of CPU cache false-sharing in Go.

A simple example where 2 integer variables are incremented concurrently.
Baseline version suffers from false-sharing due to values share same cache line:

type IntVars struct {
	i1 int64
	i2 int64
}

Optimized version eliminates false sharing by introducing padding between memory locations:

type IntVars struct {
	i1 int64
	_  cpu.CacheLinePad // padding
	i2 int64
}

Reproducing results

  1. Install benchstat:
go install golang.org/x/perf/cmd/[email protected]
  1. Run benchmarks for simple increments: a++:

▶ make bench

name                          old time/op  new time/op  delta
Increment1Value-2             1.66ns ± 8%  1.71ns ± 7%     ~     (p=0.421 n=5+5)
Increment2ValuesInParallel-2  2.34ns ± 5%  1.59ns ± 3%  -32.23%  (p=0.008 n=5+5)
  1. Run benchmarks for atomic increments: atomic.AddInt64(addr, 1):

▶ make bench-atomic

name                          old time/op  new time/op  delta
Increment1Value-2             5.65ns ± 5%  5.85ns ± 6%     ~     (p=0.310 n=5+5)
Increment2ValuesInParallel-2  41.6ns ±10%   5.4ns ± 8%  -87.12%  (p=0.008 n=5+5)

CPU cache miss stats

On Linux, one can measure L1 cache misses to demonstrate false-sharing.

  1. Build executables for both original and optimized versions:

▶ make build

GOOS=linux GOARCH=amd64 go build -o test
GOOS=linux GOARCH=amd64 go build -tags padded -o test-padded
  1. Run perf for both executables and compare numbers:

▶ perf stat -B -e L1-dcache-load-misses ./test

 Performance counter stats for './test':

         8,954,010      L1-dcache-load-misses

▶ perf stat -B -e L1-dcache-load-misses ./test-padded

 Performance counter stats for './test-padded':

           204,287      L1-dcache-load-misses

GitHub

View Github