Beautiful branchless binary search

How does the benchmarking work here? I always find this kind of micro-benchmarking hard. I feel like I want to see results with and without a preceding cache flush. And with/without clearing of the branch predictor state. Other things I find hard are: 1) ensuring that the CPU is running at full(ish) speed and isn't in a slower-clocked power saving mode for some of the test, 2) effects of code and data alignment can be significant - I want to measure a bunch of different alignments.

Does gtest (that the author used) help with these things? Does anything?