Hyperfine: A command-line benchmarking tool

1. mosselman ◴[19 Nov 24 10:41 UTC] No.42181990[source]▶

Hyperfine is great. I use it sometimes for some quick web page benchmarks:

https://abuisman.com/posts/developer-tools/quick-page-benchm...

As mentioned here in the thread, when you want to go into the single ms optimisations it is not the best approach since there is a lot of overhead especially the way I demonstrate here, but it works very well for some sanity checks.

replies(2): >>42183124 #>>42183449 #

2. llimllib ◴[19 Nov 24 13:22 UTC] No.42183124[source]▶

>>42181990 (TP) #

I find k6 a lot nicer for HTTP benching, and no slower to set up than hyperfine (which I love for CLI benching): https://k6.io/

replies(1): >>42187060 #

3. Sesse__ ◴[19 Nov 24 13:55 UTC] No.42183449[source]▶

>>42181990 (TP) #

> Hyperfine is great.

Is it, though?

What I would expect a system like this to have, at a minimum:

  * Robust statistics with p-values (not just min/max, compensation for multiple hypotheses, no Gaussian assumptions)
  * Multiple stopping points depending on said statistics.
  * Automatic isolation to the greatest extent possible (given appropriate permissions)
  * Interleaved execution, in case something external changes mid-way.

I don't see any of this in hyperfine. It just… runs things N times and then does a naïve average/min/max? At that rate, one could just as well use a shell script and eyeball the results.

replies(3): >>42183978 #>>42185320 #>>42185894 #

4. bee_rider ◴[19 Nov 24 14:46 UTC] No.42183978[source]▶

>>42183449 #

What do you suggest? Those sound like great features.

replies(1): >>42184064 #

5. Sesse__ ◴[19 Nov 24 14:54 UTC] No.42184064{3}[source]▶

>>42183978 #

I've only seen such things in internal tools so far, unfortunately, so if you see anything in public, please tell me :-) I'm just confused why everything thinks hyperfine is so awesome, when it does not meet what I'd consider a fairly low bar for benchmarking tools? (“Best publicly available” != “great”, in my book.)

replies(1): >>42185355 #

6. sharkdp ◴[19 Nov 24 16:34 UTC] No.42185320[source]▶

>>42183449 #

> Robust statistics with p-values (not just min/max, compensation for multiple hypotheses, no Gaussian assumptions)

This is not included in the core of hyperfine, but we do have scripts to compute "advanced" statistics, and to perform t-tests here: https://github.com/sharkdp/hyperfine/tree/master/scripts

Please feel free to comment here if you think it should be included in hyperfine itself: https://github.com/sharkdp/hyperfine/issues/523

> Automatic isolation to the greatest extent possible (given appropriate permissions)

This sounds interesting. Please feel free to open a ticket if you have any ideas.

> Interleaved execution, in case something external changes mid-way.

Please see the discussion here: https://github.com/sharkdp/hyperfine/issues/21

> It just… runs things N times and then does a naïve average/min/max?

While there is nothing wrong with computing average/min/max, this is not all hyperfine does. We also compute modified Z-scores to detect outliers. We use that to issue warnings, if we think the mean value is influenced by them. We also warn if the first run of a command took significantly longer than the rest of the runs and suggest counter-measures.

Depending on the benchmark I do, I tend to look at either the `min` or the `mean`. If I need something more fine-grained, I export the results and use the scripts referenced above.

> At that rate, one could just as well use a shell script and eyeball the results.

Statistical analysis (which you can consider to be basic) is just one reason why I wrote hyperfine. The other reason is that I wanted to make benchmarking easy to use. I use warmup runs, preparation commands and parametrized benchmarks all the time. I also frequently use the Markdown export or the JSON export to generate graphs or histograms. This is my personal experience. If you are not interested in all of these features, you can obviously "just as well use a shell script".

replies(1): >>42185809 #

7. sharkdp ◴[19 Nov 24 16:38 UTC] No.42185355{4}[source]▶

>>42184064 #

> “Best publicly available” != “great”

Of course. But it is free and open source. And everyone is invited to make it better.

8. Sesse__ ◴[19 Nov 24 17:16 UTC] No.42185809{3}[source]▶

>>42185320 #

> This is not included in the core of hyperfine, but we do have scripts to compute "advanced" statistics, and to perform t-tests here: https://github.com/sharkdp/hyperfine/tree/master/scripts

t-tests run afoul of the “no Gaussian assumptions”, though. Distributions arising from benchmarking frequently has various forms of skew which messes up t-tests and gives artificially narrow confidence intervals.

(I'll gladly give you credit for your outlier detection, though!)

>> Automatic isolation to the greatest extent possible (given appropriate permissions) > This sounds interesting. Please feel free to open a ticket if you have any ideas.

Off the top of my head, some option that would:

* Bind to isolated CPUs, if booted with it (isolcpus=) * Binding to a consistent set of cores/hyperthreads (the scheduler frequently sabotages benchmarking, especially if your cores are have very different maximum frequency) * Warns if thermal throttling is detected during the run * Warns if an inappropriate CPU governor is enabled * Locks the program into RAM (probably hard to do without some sort of help from the program) * Enables realtime priority if available (e.g., if isolcpus= is not enabled, or you're not on Linux)

Of course, sometimes you would _want_ to benchmark some of these effects, and that's fine. But most people probably won't, and won't know that they exist. I may easily have forgotten some.

On the flip side (making things more random as opposed to less), something that randomizes the initial stack pointer would be nice, as I've sometimes seen this go really, really wrong (renaming a binary from foo to foo_new made it run >1% slower!).

replies(1): >>42186318 #

9. renewiltord ◴[19 Nov 24 17:22 UTC] No.42185894[source]▶

>>42183449 #

Personally, I'm all about the UNIX philosophy of doing one thing and doing it well. All I want is the process to be invoked k times to do a thing with warmup etc. etc. If I want additional stats, it's easy to calculate. I just `--export-json` and then once it's in a dataframe I can do what I want with it.

10. sharkdp ◴[19 Nov 24 17:57 UTC] No.42186318{4}[source]▶

>>42185809 #

> On the flip side (making things more random as opposed to less), something that randomizes the initial stack pointer would be nice, as I've sometimes seen this go really, really wrong (renaming a binary from foo to foo_new made it run >1% slower!).

This is something we do already. We set a `HYPERFINE_RANDOMIZED_ENVIRONMENT_OFFSET` environment variable with a random-length value: https://github.com/sharkdp/hyperfine/blob/87d77c861f1b6c761a...

11. jiehong ◴[19 Nov 24 19:17 UTC] No.42187060[source]▶

>>42183124 #

Could hyperfine running curl be an alternative?

replies(1): >>42187530 #

12. mosselman ◴[19 Nov 24 20:04 UTC] No.42187530{3}[source]▶

>>42187060 #

That is what I do in my blog post.