Benchmarking with Go

The authors of the Go language have invested to ensure that benchmarking and profiling are well-supported by the language's tooling. This puts Go ahead of many other language, but the tools provided fall short of being sufficient. The standard tools don't handle measurement uncertainty, and so cannot (on their own) correctly answer basic performance questions.

There is some irony in that, while computers are designed to be very deterministic with their calculations, modern computers have quite a bit of uncertainty in how quickly they complete those calculations. The reasons for variations in speed are not really in scope for this post, but are a consequence of the complexity of modern CPUs and the surrounding infrastructure. Further, measurements should ideally be made on the same hardware as will be used in production, which can mean that measurements must be made on shared machines in the cloud. Many of the standard recommendations to minimize measurement noise cannot be used.

To handle measurements uncertainty, the first step is to measure the measurement uncertainty. The way that this is done is to repeat measurements and obtain multiple samples. This will allow you to extract a mean time and variance (standard deviation). While selecting the correct number of samples is a whole field on its own, just start with N=5. If you find that more precision is required, you can increase the count.

go test -bench=. -count=5  # don't do this!

This leads to the next problem with the tooling. Using -count=N, it is easy to get multiple samples for your benchmarks, but this approach can lead to systematic errors. Imagine, you start running the benchmark with a cold CPU, it opens up at full throttle. As the CPU heats up, it begins to throttle its speed. This can makes later benchmarks look slower than earlier benchmarks. Because the benchmarks were run in a particular order, the measurement results were confounded with other time-correlated characteristics. Ideally, the order of the benchmarks should be randomized. However, simpler, and quite straightforward to arrange, is to alternate the benchmarks.

Once the data has been collected, there is a very good tool to load the measurements and extract the necessary statistics: benchstat. However, a word of caution, p-values are notoriously difficult to interpret correctly. See the section later.

Test Script

The following script assumes that you are trying to compare benchmarks between uncommitted code in the working directory, and the previous commit. There are probably many readers familiar enough with git to see how the script can be modified to benchmark different branches or commits.

# Configure the benchmarks
BENCH=.
BENCHTIME=2s

# Initial run.  This creates or truncates the data files
go test -bench=${BENCH} -benchtime=${BENCHTIME} > new.stat
git stash > /dev/null
DESCRIBE=$(git describe)
echo "Reference: ${DESCRIBE}"
go test -bench=${BENCH} -benchtime=${BENCHTIME} > ${DESCRIBE}.stat
git stash pop > /dev/null

# Repeat the benchmarks
for i in 2 3 4 5
do
    echo "Repeat $i"
    go test -bench=${BENCH} -benchtime=${BENCHTIME} >> new.stat
    git stash > /dev/null
    go test -bench=${BENCH} -benchtime=${BENCHTIME} >> ${DESCRIBE}.stat
    git stash pop > /dev/null
done

# Display results
$GOPATH/bin/benchstat ${DESCRIBE}.stat new.stat

The above script works by running the selected benchmarks on the working files, and saving the results in a file (new.stat). The working directory is then stashed, and then the benchmarks are run again. The stash is then popped to restore the working files. In this first run, the data files are created or truncated. The reason to truncate data is again to minimize any time-correlated confounding.

The above sequence is then repeated, however the benchmark results are now appended to the data files. In this manner we can build up our measurement samples, but also spread those samples out evenly across the entire run.

Finally, benchstat is run to analyze the data files, and present the results.

Starting with Go version 1.17, tests and benchmarks can be shuffled using -shuffle=on. This would be a worthwhile addition to the above script.

Example

Consider the following diff, where a call to io.WriteString is replaced with an unchecked type conversion to io.StringWriter and calling WriteString directly.

func BenchmarkWriteString(b *testing.B) {
        w := ioutil.Discard
 
        for i := 0; i < b.N; i++ {
-           // This is the original version (committed)
-           io.WriteString(w, "This is a test.")
+           // And this version is in the working directory
+           w.(io.StringWriter).WriteString("This is a test.")
        }
}

With a benchtime of 1 second and 5 samples, the results of running the above script is:

name           old time/op  new time/op  delta
WriteString-2  65.9ns ± 7%  51.6ns ± 4%  -21.75%  (p=0.008 n=5+5)

For the old time and the new time, benchstat also reports the error (presumably for a 95% confidence interval). Unfortunately, benchstat does not report a confidence interval for the delta. Given that the old time and new time are roughly equal, the error for delta is roughly 11%. The confidence interval is therefore (-33%,-11%). We can therefore conclude that the code change did indeed decrease the time/op. Conversely, the interval could include zero, indicating that there isn't enough precision to determine whether the change improved or degraded performance. Even in this case, the actual value of -22.75 should be viewed as only an approximation.

If the number of iteration is increased from 5 to 10, the results are as follows:

name           old time/op  new time/op  delta
WriteString-2  61.8ns ± 2%  51.0ns ± 4%  -17.51%  (p=0.000 n=9+10)

Here, we can see confidence interval for the old time/op has decreased. Surprisingly, this did not occur for the new time/op. However, the error is also just an estimate, and subject to its own measurements error, and so will vary from run to run. The measurement of the delta is now refined to (-24%,-12%).

The improvement in the confidence interval might seem paltry. Unfortunately, halving the error will require quadrupling the number of samples.

p-values

As stated earlier, p-values are notoriously difficult to interpret correctly. People try to use p-values to evaluate whether or not some hypothesis is true, but actually they work the other way around. A p-value assumes that some hypothesis is true, and calculates the probability of the observed data. One technique to avoid getting trapped is to remember that they can disprove something, but they can't prove anything.

In the context of benchmarking, a hypothesis that the average times are equal is very strange. Unless you are comparing a program to itself, it's virtually guaranteed that the two different approaches will have different speeds. A priori, the average times are different. The question is by how much. In this context, a very small p-value disproves the hypothesis that the two approaches are equally fast on average, which was already known. Conversely, larger p-values are evidence of nothing, but rather indicate rather that the tests do not have sufficient precision. The sample size will need to be increased.

Confidence intervals for the delta should be preferred over p-values, but unfortunately p-values persist.

Summary

Performance benchmarking is a useful technique. Improving your code's speed requires careful measurement, and, in particular, attention to measurement uncertainty. Fortunately, with a little additional scripting, the measurement uncertainty can also be measured.

Join the discussion...