←back to thread

120 points misternugget | 1 comments | | HN request time: 0.211s | source
Show context
dmitrygr ◴[] No.42198106[source]

  >  // Parallel bit count intermediates
  >  let a = v - ((v >> 1) & (u64::MAX / 3));
  >  let b = (a & (u64::MAX / 5)) + ((a >> 2) & (u64::MAX / 5));
  >  let c = (b + (b >> 4)) & (u64::MAX / 0x11);
  >  let d = (c + (c >> 8)) & (u64::MAX / 0x101);

That "parallel bit count" is almost certainly slower than using two POPCNT instructions on a modern cpu. Should just call __builtin_popcount() and let the compiler do it the most optimal way. Luckily, people do this sort of thing so often that many modern compilers will try (and often succeed) to detect you trying this insanity and convert it to a POPCOUNT (or a pair of POPCOUNTs as the case may be here)
replies(2): >>42198289 #>>42199186 #
akoboldfrying ◴[] No.42198289[source]
Which compilers support __builtin_popcount()? From memory, it's a gcc extension. If the compiler selects a CPU POPCOUNT instruction for it, are you sure it will work on all machines that you want to run it on?

The above code is completely source- and binary-portable and reasonably fast -- certainly faster than naively looping through the bits, and within a small constant factor of a CPU POPCOUNT instruction.

replies(4): >>42198351 #>>42198414 #>>42198569 #>>42198679 #
1. aseipp ◴[] No.42198679[source]
Everything supports __builtin_popcount or some variant these days (__popcnt for MSVC). It's a complete non-issue, really.

And the compiler is not required to lower it to a single instruction. It will if the target architecture is specified appropriately, but there's nothing that says it has to explode if it can't. In fact, by doing it this way, the compiler is actually more free to generate code in a way that's optimal for the architecture in all cases, because all the implementation details are hidden e.g. loads of large constants may be avoided if the compiler is allowed to choose the exact implementation, while using the portable version may tie its hands more depending on how it's feeling on that day. Here's __builtin_popcount working just fine while targeting a ~20yo architecture without native support for SSE4.2; it can generate this code knowing what the proper instructions and schedules are: https://godbolt.org/z/ra7n5T5f3

The moral here is that the primitives are there for you to use. Just use them and save yourself and your would-be code reviewer's time.