Hashrocket.com / blog

Large tree

Switch vs. Map: Which is the Better Way to Branch in Go?

posted on and written by in

Image 100x100 jack christensen

While working on pgx, a PostgreSQL driver for Go, I have run into multiple occasions where it is necessary to choose between 20+ branches. Typically, I have done this with a switch statement. A more readable alternative could be a map of functions. However, I would assume that the simple branch of a switch statement would be faster that a map lookup and function call. Performance in a database driver is of paramount concern, so it was prudent to investigate the performance implications before making any changes.

TL;DR

Micro-benchmarks show substantial performance differences. But the ultimate answer is it will very rarely matter to the program as a whole. But if you want to understand the research and testing behind this assertion, read on.

Initial Research

Searching the internet did not reveal much. The few posts I found suggested that maps may be faster if there are enough branches. A discussion of possible switch optimizations from 2012 included Ken Thompson's opinion that there was not much room for improvement. I decided to write a benchmark measure the performance in modern Go.

A Basic Benchmark

The following results were produced from a Intel i7-4790K running go1.5.1 linux/amd64 on Ubuntu 14.04. The benchmark source code and results are on Github.

The most basic benchmark of a switch is as follows:

func BenchmarkSwitch(b *testing.B) {
  var n int

  for i := 0; i < b.N; i++ {
    switch i % 4 {
    case 0:
      n += f0(i)
    case 1:
      n += f1(i)
    case 2:
      n += f2(i)
    case 3:
      n += f3(i)
    }
  }

  // n will never be < 0, but checking n should ensure that the entire benchmark loop can't be optimized away.
  if n < 0 {
    b.Fatal("can't happen")
  }

Benchmarks, in particular micro-benchmarks like this are notoriously hard to get right. For example, if a block of code has no side effects an optimizer may omit the code entirely. The purposeful side effects to n are so the entire benchmark cannot be optimized away. There are several other pitfalls I will cover during this article.

The basic benchmark of a map of function is as follows:

func BenchmarkMapFunc4(b *testing.B) {
  var n int

  for i := 0; i < b.N; i++ {
    n += Funcs[i%4](i)
  }

  // n will never be < 0, but checking n should ensure that the entire benchmark loop can't be optimized away.
  if n < 0 {
    b.Fatal("can't happen")
  }
}

A Ruby erb template is used to create benchmarks for 4, 8, 16, 32, 64, 128, 256, and 512 branches. Initial results showed that the map version was about 25% slower than the switch version for 4 branches, roughly equivalent at 8 branches, and continued to gain until it reached over 50% faster at 512 branches.

Accounting for Inlineable Functions

The previous benchmark gave some initial results, but it is not sufficient to stop here. There are multiple factors that have not been accounted for yet. First of these is whether the function can be inlined or not. A function call may be inlined in a switch statement, but that will never happen in a map of functions. It is worth determining how much of an impact that has on performance.

The following function does a trivial amount of work to ensure that the entire call cannot be optimized away, but the function as a whole can be inlined by the Go compiler.

func f0(n int) int {
  if n%2 == 0 {
    return n
  } else {
    return 0
  }
}

At time of writing, the Go compiler cannot inline functions that can panic. The following function includes a panic call that will never be hit making it impossible for the function to be inlined.

func noInline0(n int) int {
  if n < 0 {
    panic("can't happen - but should ensure this function is not inlined")
  } else if n%2 == 0 {
    return n
  } else {
    return 0
  }
}

When the function cannot be inlined, performance changes substantially. The map version is ~30% faster at only 4 branches and over 300% faster at 512 branches than the switch version.

Compute or Lookup Branch Destination

The above benchmarks choose which branch to take based on the index of the benchmark loop.

  for i := 0; i < b.N; i++ {
    switch i % 4 {
      // ...
    }
  }

While this lets us measure just the branch performance, in a real-world scenario the branch discriminator would typically incur a memory read. To simulate this, I introduced a simple lookup to find the branch.

var ascInputs []int

func TestMain(m *testing.M) {
  for i := 0; i < 4096; i++ {
    ascInputs = append(ascInputs, i)
  }

  os.Exit(m.Run())
}

func BenchmarkSwitch(b *testing.B) {
  // ...
  for i := 0; i < b.N; i++ {
    switch ascInputs[i%len(ascInputs)] % 4 {
    // ...
  }
  // ...
}

This change drastically reduced performance. The best performing branch benchmark went from 1.99 ns/op to 8.18 ns/op. The best performing map benchmark went from 2.39 ns/op to 10.6 ns/op. The exact numbers varied somewhat for the different benchmarks, but in general the lookup added about 7 ns/op.

Unpredictable Branch Order

Astute readers may have noticed that the benchmark is still unrealistically predictable. It always follows branch 0, then 1, then 2, etc. To fix that, the slice of branch choices is randomized.

var randInputs []int

func TestMain(m *testing.M) {
  for i := 0; i < 4096; i++ {
    randInputs = append(randInputs, rand.Int())
  }

  os.Exit(m.Run())
}

func BenchmarkSwitch(b *testing.B) {
  // ...
  for i := 0; i < b.N; i++ {
    switch randInputs[i%len(ascInputs)] % 4 {
    // ...
  }
  // ...
}

This change further hinders performance. At 32 branches, the map lookup goes from 11 ns to 22 ns. Exact numbers vary based on number of branches and whether inlineable functions are used but performance is typically cut in half.

Diving Way Deeper

While the loss of performance from computed to looked up branch destination is somewhat expected due to the additional memory read, the switch from predictable to random order of visiting branch destinations is more surprising. To discover the underlying reason we turn to the Linux perf tool. It can give CPU-level statistics like cache misses and branch-prediction misses.

To avoid profiling the compilation of the test, the test binary can be precompiled.

go test -c

Then we can ask the perf tool to give us statistics on one of our predictable lookup benchmarks.

$ perf stat -e cache-references,cache-misses,branches,branch-misses ./go_map_vs_switch.test -test.bench=PredictableLookupMapNoInlineFunc512 -test.benchtime=5s

The interesting part of the output is the branch prediction statistics:

9,919,244,177      branches
10,675,162    branch-misses  # 0.11% of all branches

So branch prediction is working great when the order is predictable.

But something different happens when we run the unpredictable branch benchmark.

$ perf stat -e cache-references,cache-misses,branches,branch-misses ./go_map_vs_switch.test -test.bench=UnpredictableLookupMapNoInlineFunc512 -test.benchtime=5s

Relevant output:

3,618,549,427 branches                                                    
  451,154,480 branch-misses  # 12.47% of all branches        

The branch miss rate goes up by over 100 times.

Conclusion

The original question I wanted to answer was if there would be a performance impact in converting a switch statement to a map lookup. I assumed that switches would be faster. I was wrong. Maps are usually faster, often several times faster.

Does this mean we should prefer using maps over switches? No. While the percentage changes are substantial, the absolute time difference is small. It will only matter if the branch is executed millions of times per second and the actual work performed is minimal. Additionally, even if that is the case, the memory access pattern and branch prediction success rate have a bigger impact on performance than the choice of a switch or map.

The choice of switch or map should be made based on whichever produces the clearest code.

Posted in Development and tagged with Go