2025-02-24 08:18:01 +01:00
|
|
|
---
|
|
|
|
title: Realistic Benchmarks
|
|
|
|
tags: article software benchmarks
|
|
|
|
created: 2025-02-24T05:49:06Z
|
|
|
|
published: true
|
|
|
|
---
|
|
|
|
|
2025-02-24 08:34:27 +01:00
|
|
|
When people talk about realistic software benchmarks, it's important to realize that absolute numbers do not matter.
|
2025-02-24 08:18:01 +01:00
|
|
|
Nobody cares if our algorithm took 2 nanoseconds or 3 nanoseconds to compute.
|
|
|
|
This depends on the power and configuration of our machine and doesn't convey any useful meaning.
|
|
|
|
|
|
|
|
What we really care about is the relative relationships between different algorithms.
|
|
|
|
|
2025-02-27 10:27:58 +01:00
|
|
|
By itself, a number like 2 nanoseconds is not very helpful.
|
|
|
|
However, once we add another competing algorithm to the mix, we can compare it.
|
2025-02-24 08:18:01 +01:00
|
|
|
When algorithm A takes 2 ns to compute and algorithm B takes 4 ns to compute,
|
|
|
|
we can see a relationship between A and B and that is the fact that A is twice as fast as B.
|
|
|
|
|
|
|
|
```
|
|
|
|
time(A) = 0.5 * time(B)
|
|
|
|
```
|
|
|
|
|
|
|
|
The goal of a realistic benchmark is not to reproduce the timing of 2 ns and 4 ns.
|
|
|
|
|
|
|
|
The goal of a realistic benchmark is to approximate the `A = 0.5 * B` relationship as good as possible.
|
|
|
|
|
2025-02-27 10:27:58 +01:00
|
|
|
Sometimes, a different machine will lead to not only a change in absolute numbers,
|
|
|
|
but also a change in relative relations between the algorithms.
|
2025-02-24 08:18:01 +01:00
|
|
|
With this realization we can no longer view the `0.5` relation between A and B as a fixed number,
|
|
|
|
instead it is machine dependent and we should generalize it as a scalar `s`.
|
|
|
|
|
|
|
|
```
|
|
|
|
time(A) = s * time(B)
|
|
|
|
```
|
|
|
|
|
|
|
|
Now we get a better definition of what a realistic benchmark is:
|
|
|
|
|
|
|
|
> Two benchmarks can be compared in their quality by looking at how well they approximate `s` on the same machine.
|
|
|
|
|
|
|
|
But how do we actually get closer to the real value of `s`?
|
|
|
|
|
|
|
|
One of the most significant factors is the sample size.
|
|
|
|
In order to get a more realistic outcome, we need a lot of samples.
|
|
|
|
Ideally we want an infinite amount of test samples, because the more we have, the closer we approach the real result of `s`.
|
|
|
|
|
|
|
|
This means that in the case of a web server, we want as much load as possible from our stress testing tool,
|
|
|
|
because more samples will bring us closer to the real relative performance relationships of the algorithms tested.
|
2025-02-24 08:27:22 +01:00
|
|
|
The less stress we put on the server, the worse our results become.
|
2025-02-24 08:18:01 +01:00
|
|
|
|
2025-02-27 10:27:58 +01:00
|
|
|
## The woods
|
2025-02-24 08:18:01 +01:00
|
|
|
|
2025-02-27 10:27:58 +01:00
|
|
|
Just imagine you're out in the woods with a crossbow.
|
|
|
|
Now a bear jumps out of nothing and you are fighting for your life.
|
2025-02-27 10:41:56 +01:00
|
|
|
You are going to need a couple of bolts to stop the bear running at you, and you need to shoot them as fast as possible.
|
|
|
|
|
|
|
|
You bought that crossbow based on a benchmark published by an outlet that focuses on crossbows for beginners.
|
2025-02-27 10:27:58 +01:00
|
|
|
Because it's focused on beginners, they have a hard capped reloading speed of 1 bolt per 5 seconds.
|
|
|
|
Even though there are crossbolts that can reload in 3 seconds, the benchmark would not reflect this,
|
2025-02-27 10:41:56 +01:00
|
|
|
because they are not targeted at experts and assume that everybody is clumsy.
|
2025-02-27 10:27:58 +01:00
|
|
|
|
|
|
|
The benchmark showed two crossbows, A and B, but it concluded that both let you reload 1 bolt every 5 seconds,
|
|
|
|
even though B can actually be reloaded in 3 seconds. But that result was not published.
|
2025-02-27 10:41:56 +01:00
|
|
|
So you ended up buying crossbow A instead of B, leading to your death against the bear.
|
2025-02-27 10:27:58 +01:00
|
|
|
The inaccuracy of the benchmark lead to your downfall.
|
|
|
|
|
|
|
|
In a life or death situation, you want the fastest crossbow out there.
|
|
|
|
You want to see the differences in reload speed when they're performed at the highest level.
|
|
|
|
Because when shit hits the fan, you want the sharpest tool in the shed.
|
|
|
|
|
|
|
|
## Conclusion
|
|
|
|
|
|
|
|
As a public benchmark developer, please do not confuse these two types of realism.
|
|
|
|
Getting more "realistic" numbers by throttling your benchmark just ends up hurting people because you focused on absolute numbers.
|
|
|
|
A realistic approximation of `s` performed under maximum load is a much more stable
|
|
|
|
result across different machines and this is what people truly care about.
|