urbach.dev/realistic-benchmarks.md at e710216e5ea7be63e9521eb4d40732bac0e52d6a

2025-02-24 08:27:22 +01:00

2.7 KiB

Raw Blame History

title	tags	created	published
Realistic Benchmarks	article software benchmarks	2025-02-24T05:49:06Z	true

When people talk about realistic software benchmarks, it's important to realize that absolute numbers do not matter to anyone. Nobody cares if our algorithm took 2 nanoseconds or 3 nanoseconds to compute. This depends on the power and configuration of our machine and doesn't convey any useful meaning.

What we really care about is the relative relationships between different algorithms.

By itself, a number like 2 nanoseconds is non-significant and not useful. However, once we add another competing algorithm to the mix, it becomes interesting. When algorithm A takes 2 ns to compute and algorithm B takes 4 ns to compute, we can see a relationship between A and B and that is the fact that A is twice as fast as B.

time(A) = 0.5 * time(B)

The goal of a realistic benchmark is not to reproduce the timing of 2 ns and 4 ns.

The goal of a realistic benchmark is to approximate the A = 0.5 * B relationship as good as possible.

Sometimes, a different machine will lead to not only a change in absolute numbers, but also a change in relative relations between the algorithms. With this realization we can no longer view the 0.5 relation between A and B as a fixed number, instead it is machine dependent and we should generalize it as a scalar s.

time(A) = s * time(B)

Now we get a better definition of what a realistic benchmark is:

Two benchmarks can be compared in their quality by looking at how well they approximate s on the same machine.

But how do we actually get closer to the real value of s?

One of the most significant factors is the sample size. In order to get a more realistic outcome, we need a lot of samples. Ideally we want an infinite amount of test samples, because the more we have, the closer we approach the real result of s.

This means that in the case of a web server, we want as much load as possible from our stress testing tool, because more samples will bring us closer to the real relative performance relationships of the algorithms tested. The less stress we put on the server, the worse our results become.

Some people have the misconception that a realistic benchmark should produce realistic absolute numbers. Realism in absolute numbers, especially when it comes at the cost of a worse approximation of s, is not useful to anyone.

As a benchmark developer, please do not confuse these 2 types of realism. Realistic absolute numbers are just some volatile random numbers on somebody's machine. Realistic relative relations in the form of a good approximation of s are much more stable across different machines and this is what people truly care about.

2.7 KiB Raw Blame History

2.7 KiB

Raw Blame History