Go Performance Observations

by Jack Christensen on August 7, 2014

In the course of optimizing the pgx PostgreSQL driver I observed a number of performance characteristics of Go that I hope you will find useful.

Measure First

"Premature optimization is the root of all evil" -- Donald Knuth

Go has two tools that are invaluable in performance tuning: a profiler and a benchmarking tool. The profiler helps find the trouble spots and benchmarks show the results of an optimization. See How to write benchmarks in Go by Dave Cheney and Profiling Go Programs by Russ Cox for introductions to these tools. Below are several specific techniques I found with benchmarks and the profiler. Source code for the benchmarks is on Github.

Reuse Memory

Every allocation of memory has several potential costs. The Go runtime must ensure that the memory is initialized to the zero value. The garbage collector must track the references to value and eventually clean it up. Additional memory usage also makes it less likely to get a CPU cache hit.

This simple example fills a slice of up to 1024 bytes with 1s.

func BenchmarkNewBuffers(b *testing.B) {
  for i := 0; i < b.N; i++ {
    n := rand.Intn(1024)
    buf := make([]byte, n)

    // Do something with buffer
    for j := 0; j < n; j++ {
      buf[j] = 1
    }
  }
}

func BenchmarkReuseBuffers(b *testing.B) {
  sharedBuf := make([]byte, 1024)

  for i := 0; i < b.N; i++ {
    n := rand.Intn(1024)
    buf := sharedBuf[0:n]

    // Do something with buffer
    for j := 0; j < n; j++ {
      buf[j] = 1
    }
  }
}

Note the -test.benchmem flag for measuring memory allocations.

jack@hk-47~/dev/go/src/github.com/jackc/go_pgx_perf_observations$ go test -test.bench=Buffers -test.benchmem
testing: warning: no tests to run
PASS
BenchmarkNewBuffers  2000000        1033 ns/op       540 B/op        0 allocs/op
BenchmarkReuseBuffers  5000000         436 ns/op         0 B/op        0 allocs/op
ok    github.com/jackc/go_pgx_perf_observations 5.704s

Allocating a new buffer each iteration is substantially slower. Obviously, the more work done on the buffer the less relative impact eliminating the allocation would have. Surprisingly, both versions show 0 allocs/op. How can that be? Let's rerun the test with the -gcflags=-m option to ask Go to tell us the details.

jack@hk-47~/dev/go/src/github.com/jackc/go_pgx_perf_observations$ go test -gcflags=-m -test.bench=Buffers -test.benchmem
# github.com/jackc/go_pgx_perf_observations_test
<snip/>
./bench_test.go:15: BenchmarkNewBuffers b does not escape
./bench_test.go:18: BenchmarkNewBuffers make([]byte, n) does not escape
./bench_test.go:27: BenchmarkReuseBuffers b does not escape
./bench_test.go:28: BenchmarkReuseBuffers make([]byte, 1024) does not escape
<snip/>

The Go compiler performs escape analysis. If an allocation does not escape the function it can be stored on the stack and avoid the garbage collector entirely.

So in a real world system where many allocations do escape to the heap reducing the allocations can have an even bigger impact.

Buffered IO

Go does not buffer IO by default. The bufio package provides buffered IO. This can make a massive difference in performance.

func BenchmarkUnbufferedFileWrite(b *testing.B) {
  file, err := os.Create("unbuffered.test")
  if err != nil {
    b.Fatalf("Unable to create file: %v", err)
  }
  defer func() {
    file.Close()
    os.Remove(file.Name())
  }()

  for i := 0; i < b.N; i++ {
    fmt.Fprintln(file, "Hello world")
  }
}

func BenchmarkBufferedFileWrite(b *testing.B) {
  file, err := os.Create("buffered.test")
  if err != nil {
    b.Fatalf("Unable to create file: %v", err)
  }
  defer func() {
    file.Close()
    os.Remove(file.Name())
  }()

  writer := bufio.NewWriter(file)
  defer writer.Flush()

  for i := 0; i < b.N; i++ {
    fmt.Fprintln(writer, "Hello world")
  }
}

jack@hk-47~/dev/go/src/github.com/jackc/go_pgx_perf_observations$ go test -test.bench=Write
testing: warning: no tests to run
PASS
BenchmarkUnbufferedFileWrite   1000000        2588 ns/op
BenchmarkBufferedFileWrite  10000000         271 ns/op
ok    github.com/jackc/go_pgx_perf_observations 5.626s

A simple test of writing "Hello, world" repeatedly to a file shows a greater than 9x improvement in performance by using a buffered writer.

Binary vs. Text Formats

PostgreSQL allows the transmission of data in binary or text format. The performance of the binary format is far faster than the text format. This is because the only processing typically needed is converting from network byte order. The binary format should also be more efficient for the PostgreSQL server and it may be a more compact transmission format. However, we will isolate our benchmarks to the parsing of int32 and time.Time values.

func BenchmarkParseInt32Text(b *testing.B) {
  s := "12345678"
  expected := int32(12345678)

  for i := 0; i < b.N; i++ {
    n, err := strconv.ParseInt(s, 10, 32)
    if err != nil {
      b.Fatalf("strconv.ParseInt failed: %v", err)
    }
    if int32(n) != expected {
      b.Fatalf("strconv.ParseInt decoded %v instead of %v", n, expected)
    }
  }
}

func BenchmarkParseInt32Binary(b *testing.B) {
  buf := make([]byte, 4)
  binary.BigEndian.PutUint32(buf, 12345678)
  expected := int32(12345678)

  for i := 0; i < b.N; i++ {
    n := int32(binary.BigEndian.Uint32(buf))
    if n != expected {
      b.Fatalf("Got %v instead of %v", n, expected)
    }
  }
}

func BenchmarkParseTimeText(b *testing.B) {
  s := "2011-10-25 09:12:34.345921-05"
  expected, _ := time.Parse("2006-01-02 15:04:05.999999-07", s)

  for i := 0; i < b.N; i++ {
    t, err := time.Parse("2006-01-02 15:04:05.999999-07", s)
    if err != nil {
      b.Fatalf("time.Parse failed: %v", err)
    }
    if t != expected {
      b.Fatalf("time.Parse decoded %v instead of %v", t, expected)
    }
  }
}

// PostgreSQL binary format is an int64 of the number of microseconds since Y2K
func BenchmarkParseTimeBinary(b *testing.B) {
  microsecFromUnixEpochToY2K := int64(946684800 * 1000000)

  s := "2011-10-25 09:12:34.345921-05"
  expected, _ := time.Parse("2006-01-02 15:04:05.999999-07", s)

  microsecSinceUnixEpoch := expected.Unix()*1000000 + int64(expected.Nanosecond())/1000
  microsecSinceY2K := microsecSinceUnixEpoch - microsecFromUnixEpochToY2K

  buf := make([]byte, 8)
  binary.BigEndian.PutUint64(buf, uint64(microsecSinceY2K))

  for i := 0; i < b.N; i++ {
    microsecSinceY2K := int64(binary.BigEndian.Uint64(buf))
    microsecSinceUnixEpoch := microsecFromUnixEpochToY2K + microsecSinceY2K
    t := time.Unix(microsecSinceUnixEpoch/1000000, (microsecSinceUnixEpoch%1000000)*1000)
    if t != expected {
      b.Fatalf("Got %v instead of %v", t, expected)
    }
  }
}

jack@hk-47~/dev/go/src/github.com/jackc/go_pgx_perf_observations$ go test -test.bench=Parse
testing: warning: no tests to run
PASS
BenchmarkParseInt32Text 50000000          62.8 ns/op
BenchmarkParseInt32Binary 500000000          3.40 ns/op
BenchmarkParseTimeText   2000000         775 ns/op
BenchmarkParseTimeBinary  100000000         15.4 ns/op
ok    github.com/jackc/go_pgx_perf_observations 9.159s

Parsing an int32 takes over 18x longer than to parse from text than simply to read in binary. Parsing a time takes over 84x longer. The absolute numbers are small, but they add up. In general, binary protocols are vastly faster than text protocols.

More Binary Tricks

When reading or writing a binary stream using binary.Read with an io.Reader or binary.Write with an io.Writer is very convenient. But working directly with a []byte and binary.BigEndian.Get* or binary.BigEndian.Put* is more efficient.

func BenchmarkBinaryWrite(b *testing.B) {
  buf := &bytes.Buffer{}

  for i := 0; i < b.N; i++ {
    buf.Reset()

    for j := 0; j < 10; j++ {
      binary.Write(buf, binary.BigEndian, int32(j))
    }
  }
}

func BenchmarkBinaryPut(b *testing.B) {
  var writebuf [1024]byte

  for i := 0; i < b.N; i++ {
    buf := writebuf[0:0]

    for j := 0; j < 10; j++ {
      b := make([]byte, 4)
      binary.BigEndian.PutUint32(b, uint32(j))
      buf = append(buf, b...)
    }
  }
}

jack@hk-47~/dev/go/src/github.com/jackc/go_pgx_perf_observations$ go test -test.bench=BenchmarkBinary -test.benchmem
testing: warning: no tests to run
PASS
BenchmarkBinaryWrite   1000000        1075 ns/op        80 B/op        5 allocs/op
BenchmarkBinaryPut  20000000         113 ns/op         0 B/op        0 allocs/op
ok    github.com/jackc/go_pgx_perf_observations 3.485s

Not only is binary.Write much slower, it also incurs additional memory allocations. Just this change made a substantial improvement to pgx performance.

Measure Last

Let me close with another warning to measure before committing optimizations. One use case I wanted to optimize was that of a web API that served JSON produced directly in PostgreSQL. The normal way to do this is to read the JSON into a string then write that string to the HTTP io.Writer. But wouldn't it be so much faster to copy directly from the PostgreSQL io.Reader to the HTTP io.Writer? It's obvious it should be faster, but unfortunately it is incorrect. Benchmarks revealed it was actually slower in the vast majority of cases, and only marginally faster in the best cases.

So once again: measure first and measure last.

PostgreSQL