←back to thread

386 points ingve | 3 comments | | HN request time: 0.62s | source
Show context
SleepyMyroslav ◴[] No.35738552[source]
If being branchless is important property of the algorithm then it is better to enforce it. Or at least test for it. If his GCC version will get an update and it will stop producing assembly that he wants no-one will ever know.

Which brings us back to regular discussion: C ( and C++ ) does not match hardware anymore. There is no real control over important properties of generated code. Programmers need tools to control what they write. Plug and pray compilation is not a solid engineering approach.

replies(6): >>35738629 #>>35738633 #>>35738677 #>>35738943 #>>35741009 #>>35745436 #
1. dist1ll ◴[] No.35741009[source]
> Which brings us back to regular discussion: C ( and C++ ) does not match hardware anymore. There is no real control over important properties of generated code. Programmers need tools to control what they write. Plug and pray compilation is not a solid engineering approach.

Not sure how one relates to the other. Do you want more fine-grained control over emitted assembly? Or do you want a general-purpose CPU that exposes microarchitecture details (i.e. "matching hardware")?

Only the former can be solved by tooling.

replies(1): >>35741788 #
2. waynecochran ◴[] No.35741788[source]
C++ could add a an explicit conditional move I suppose. `x` and `y` types would have to be restrictive:

      x = std::cmove(y,flag);
The compiler would be slightly mrve compelled to use a hardware conditional move than in the following case:

      if (flag) x = y;
The other option is to do something more like CUDA / SIMD kernels do ... every line gets executed but each each instruction inside the "false branch" becomes a no-op. Of course this requires hardware support.
replies(1): >>35753963 #
3. touisteur ◴[] No.35753963[source]
Pushing this a bit more, one could read about SPMD and ISPC (originally by Matt Pharr) https://pharr.org/matt/blog/2018/04/30/ispc-all

I've used it (sparringly) for vectorizable, branchy code and it's mostly been a simple process, with very efficient binaries produced (often beating hand written intrinsics by intermediate level coder - themselves beating the autovectorizer).

Don't know about using it in prod on multi generational hardware though.