Most active commenters
  • josephg(4)

←back to thread

386 points ingve | 22 comments | | HN request time: 1.549s | source | bottom
1. pizza234 ◴[] No.35738299[source]
Some information on the CMOV can be found on the Intel Optimization Reference Manual (https://cdrdv2-public.intel.com/671488/248966-046A-software-...).

Torvalds was famously critical of it (https://yarchive.net/comp/linux/cmov.html); part of the criticism is now moot though, due to low latencies on modern processors (it seems 1 cycle or less, although the instruction consumes internal flags).

His idea seems to be still applicable and consistent with Intel's recommendation: make branches predictable, and only after, use CMOV for the remaining ones. His fundametnal assumption is that "even if you were to know that something is unpredictable, it's going to be very rare.".

replies(3): >>35738345 #>>35741256 #>>35742105 #
2. 2102922286 ◴[] No.35738345[source]
This is one area where Profile-Guided Optimization (PGO) can help a lot! With PGO, you run your program on some sample input and it logs info like how many times each side of a branch was taken. From there, you can recompile your code. If the compiler sees that one side of the branch dominates, it can emit code to prioritize that branch. However if the branch counts are approximately even and the branch is hard to predict (n.b. this is technically distinct from having even branch counts), then the compiler can know that the CPU would have trouble predicting the branch, and can emit a cmov instruction instead.
replies(4): >>35738413 #>>35739365 #>>35740748 #>>35745104 #
3. berkut ◴[] No.35738413[source]
Yeah, it's mildly annoying there's no (at least to my knowledge?) compiler hint like the '__builtin_expect' ones to tell compilers that the values are very unlikely going to be predictable with enough accuracy to allow the branch predictors to be useful in general, and to use a cmov instead of the traditional branching instructions because of this.
replies(2): >>35738478 #>>35739092 #
4. dist1ll ◴[] No.35738478{3}[source]
clang has __builtin_unpredictable [0]

[0] https://clang.llvm.org/docs/LanguageExtensions.html#builtin-...

replies(2): >>35738484 #>>35747016 #
5. berkut ◴[] No.35738484{4}[source]
Oooooh :)

Thanks!

6. gpderetta ◴[] No.35739092{3}[source]
GCC has __builtin_expect_with_probability now. YMMV.
7. josephg ◴[] No.35739365[source]
I’m always surprised by rust’s performance for this reason. The compiler outputs huge binaries, chock full of bounds checks and the like. But performance doesn’t seem to suffer at all from it. On the contrary - I ported some well optimized C to rust and it ran faster.

I can only assume the compiler is marking all the bounds checks as unlikely to fail, and correctly predicted branches must be more or less free in modern CPUs. It’s the only way I can explain it.

replies(3): >>35739914 #>>35748312 #>>35750685 #
8. eska ◴[] No.35739914{3}[source]
The compiler will outright remove bounds checks that are implied by previous bounds checks.

It is a common low level optimization to add a wide boundary check in the beginning to avoid successive smaller bounds checks.

replies(1): >>35740998 #
9. xmcqdpt2 ◴[] No.35740748[source]
If your PGO program ends up in a situation which is the opposite of that you profiled, couldn't you end up with vastly worse performance than a non-optimized program?

I've always felt uncomfortable about PGO for more complex programs because of this.

replies(1): >>35745100 #
10. josephg ◴[] No.35740998{4}[source]
True. But if binary size is anything to go by, an awful lot of bounds checks still end up in the code.
replies(1): >>35741226 #
11. tomsmeding ◴[] No.35741226{5}[source]
Doesn't a rust binary also include a whole lot of standard library? C compilers assume that libc is available, a rust compiler is hardly going to assume that rust's libstd is installed on a random user's machine.
replies(1): >>35751925 #
12. zulu-inuoe ◴[] No.35741256[source]
404 on that Torvalds link
replies(2): >>35741714 #>>35742100 #
13. pizza234 ◴[] No.35741714[source]
The HN renderer is rendering the closing round bracket and semicolon as part of the HTML link, which is unfortunate for this case, but good to know :)
14. msla ◴[] No.35742100[source]
https://yarchive.net/comp/linux/cmov.html
15. msla ◴[] No.35742105[source]
https://yarchive.net/comp/linux/cmov.html
16. eklitzke ◴[] No.35745100{3}[source]
Right, this is why you should use AutoFDO nowadays not PGO. With AutoFDO you occasionally (e.g. in prod this might be something like record for 1s on average every 300s) record what branches were taken using perf-record from prod binaries, and then feed this back to the compiler.
17. eklitzke ◴[] No.35745104[source]
I commented on this elsewhere as well, but nowadays you should reach first for AutoFDO rather than PGO.
18. usefulcat ◴[] No.35747016{4}[source]
It does, but unfortunately it doesn't always work. For example (note the -march=broadwell option):

https://godbolt.org/z/n1Kz4G7GE

19. winstonewert ◴[] No.35748312{3}[source]
Rust's large file sizes are due to relying on static linking, not particularly the extra bounds checking.
20. rocqua ◴[] No.35750685{3}[source]
I believe that in release mode Rust will remove bounds checks.
replies(1): >>35750975 #
21. josephg ◴[] No.35750975{4}[source]
No, this is not true. Rust removes integer overflow checks in release mode, but it keeps array bounds checks.

You can see this pretty easily in compiler explorer. This is a simple array lookup compiled in release mode:

https://rust.godbolt.org/z/dhz34KEvj

22. josephg ◴[] No.35751925{6}[source]
Yeah it does, but that’s not enough to explain how large rust programs tend to get. The equivalent C programs, even statically linked, are usually several times smaller than their equivalents in rust.

And the “compile in the standard library” pattern will probably never be changed in rust due to monomorphization. Lots of types in the standard library need to be specialised based on the types of your structs. Even if rust had an ABI for dynamic linking (it doesn’t), structs and functions which take generic type parameters will probably never be able to be loaded from a shared library. Same as C++.