posts/realistic-benchmarks.md

---
title: Realistic Benchmarks
tags: article software benchmarks
created: 2025-02-24T05:49:06Z
published: true
---

When people talk about realistic software benchmarks, it's important to realize that absolute numbers do not matter.
Nobody cares if our algorithm took 2 nanoseconds or 3 nanoseconds to compute.
This depends on the power and configuration of our machine and doesn't convey any useful meaning.

What we really care about is the relative relationships between different algorithms.

By itself, a number like 2 nanoseconds is not very helpful.
However, once we add another competing algorithm to the mix, we can compare it.
When algorithm A takes 2 ns to compute and algorithm B takes 4 ns to compute,
we can see a relationship between A and B and that is the fact that A is twice as fast as B.

```
time(A) = 0.5 * time(B)
```

The goal of a realistic benchmark is not to reproduce the timing of 2 ns and 4 ns.

The goal of a realistic benchmark is to approximate the `A = 0.5 * B` relationship as good as possible.

Sometimes, a different machine will lead to not only a change in absolute numbers,
but also a change in relative relations between the algorithms.
With this realization we can no longer view the `0.5` relation between A and B as a fixed number,
instead it is machine dependent and we should generalize it as a scalar `s`.

```
time(A) = s * time(B)
```

Now we get a better definition of what a realistic benchmark is:

> Two benchmarks can be compared in their quality by looking at how well they approximate `s` on the same machine.

But how do we actually get closer to the real value of `s`?

One of the most significant factors is the sample size.
In order to get a more realistic outcome, we need a lot of samples.
Ideally we want an infinite amount of test samples, because the more we have, the closer we approach the real result of `s`.

This means that in the case of a web server, we want as much load as possible from our stress testing tool,
because more samples will bring us closer to the real relative performance relationships of the algorithms tested.
The less stress we put on the server, the worse our results become.

## The woods

Just imagine you're out in the woods with a crossbow.
Now a bear jumps out of nothing and you are fighting for your life.
You are going to need a couple of bolts to stop the bear running at you, and you need to shoot them as fast as possible.

You bought that crossbow based on a benchmark published by an outlet that focuses on crossbows for beginners.
Because it's focused on beginners, they have a hard capped reloading speed of 1 bolt per 5 seconds.
Even though there are crossbolts that can reload in 3 seconds, the benchmark would not reflect this,
because they are not targeted at experts and assume that everybody is clumsy.

The benchmark showed two crossbows, A and B, but it concluded that both let you reload 1 bolt every 5 seconds,
even though B can actually be reloaded in 3 seconds. But that result was not published.
So you ended up buying crossbow A instead of B, leading to your death against the bear.
The inaccuracy of the benchmark lead to your downfall.

In a life or death situation, you want the fastest crossbow out there.
You want to see the differences in reload speed when they're performed at the highest level.
Because when shit hits the fan, you want the sharpest tool in the shed.

## Conclusion

As a public benchmark developer, please do not confuse these two types of realism.
Getting more "realistic" numbers by throttling your benchmark just ends up hurting people because you focused on absolute numbers.
A realistic approximation of `s` performed under maximum load is a much more stable
result across different machines and this is what people truly care about.
Added a new post 2025-02-24 08:18:01 +01:00			`---`
			`title: Realistic Benchmarks`
			`tags: article software benchmarks`
			`created: 2025-02-24T05:49:06Z`
			`published: true`
			`---`

Updated benchmarks post 2025-02-24 08:34:27 +01:00			`When people talk about realistic software benchmarks, it's important to realize that absolute numbers do not matter.`
Added a new post 2025-02-24 08:18:01 +01:00			`Nobody cares if our algorithm took 2 nanoseconds or 3 nanoseconds to compute.`
			`This depends on the power and configuration of our machine and doesn't convey any useful meaning.`

			`What we really care about is the relative relationships between different algorithms.`

Updated article 2025-02-27 10:27:58 +01:00			`By itself, a number like 2 nanoseconds is not very helpful.`
			`However, once we add another competing algorithm to the mix, we can compare it.`
Added a new post 2025-02-24 08:18:01 +01:00			`When algorithm A takes 2 ns to compute and algorithm B takes 4 ns to compute,`
			`we can see a relationship between A and B and that is the fact that A is twice as fast as B.`

			```
			`time(A) = 0.5 * time(B)`
			```

			`The goal of a realistic benchmark is not to reproduce the timing of 2 ns and 4 ns.`

			The goal of a realistic benchmark is to approximate the `A = 0.5 * B` relationship as good as possible.

Updated article 2025-02-27 10:27:58 +01:00			`Sometimes, a different machine will lead to not only a change in absolute numbers,`
			`but also a change in relative relations between the algorithms.`
Added a new post 2025-02-24 08:18:01 +01:00			With this realization we can no longer view the `0.5` relation between A and B as a fixed number,
			instead it is machine dependent and we should generalize it as a scalar `s`.

			```
			`time(A) = s * time(B)`
			```

			`Now we get a better definition of what a realistic benchmark is:`

			> Two benchmarks can be compared in their quality by looking at how well they approximate `s` on the same machine.

			But how do we actually get closer to the real value of `s`?

			`One of the most significant factors is the sample size.`
			`In order to get a more realistic outcome, we need a lot of samples.`
			Ideally we want an infinite amount of test samples, because the more we have, the closer we approach the real result of `s`.

			`This means that in the case of a web server, we want as much load as possible from our stress testing tool,`
			`because more samples will bring us closer to the real relative performance relationships of the algorithms tested.`
Updated benchmarks post 2025-02-24 08:27:22 +01:00			`The less stress we put on the server, the worse our results become.`
Added a new post 2025-02-24 08:18:01 +01:00
Updated article 2025-02-27 10:27:58 +01:00			`## The woods`
Added a new post 2025-02-24 08:18:01 +01:00
Updated article 2025-02-27 10:27:58 +01:00			`Just imagine you're out in the woods with a crossbow.`
			`Now a bear jumps out of nothing and you are fighting for your life.`
Updated article 2025-02-27 10:41:56 +01:00			`You are going to need a couple of bolts to stop the bear running at you, and you need to shoot them as fast as possible.`

			`You bought that crossbow based on a benchmark published by an outlet that focuses on crossbows for beginners.`
Updated article 2025-02-27 10:27:58 +01:00			`Because it's focused on beginners, they have a hard capped reloading speed of 1 bolt per 5 seconds.`
			`Even though there are crossbolts that can reload in 3 seconds, the benchmark would not reflect this,`
Updated article 2025-02-27 10:41:56 +01:00			`because they are not targeted at experts and assume that everybody is clumsy.`
Updated article 2025-02-27 10:27:58 +01:00
			`The benchmark showed two crossbows, A and B, but it concluded that both let you reload 1 bolt every 5 seconds,`
			`even though B can actually be reloaded in 3 seconds. But that result was not published.`
Updated article 2025-02-27 10:41:56 +01:00			`So you ended up buying crossbow A instead of B, leading to your death against the bear.`
Updated article 2025-02-27 10:27:58 +01:00			`The inaccuracy of the benchmark lead to your downfall.`

			`In a life or death situation, you want the fastest crossbow out there.`
			`You want to see the differences in reload speed when they're performed at the highest level.`
			`Because when shit hits the fan, you want the sharpest tool in the shed.`

			`## Conclusion`

			`As a public benchmark developer, please do not confuse these two types of realism.`
			`Getting more "realistic" numbers by throttling your benchmark just ends up hurting people because you focused on absolute numbers.`
			A realistic approximation of `s` performed under maximum load is a much more stable
			`result across different machines and this is what people truly care about.`