←back to thread

120 points misternugget | 1 comments | | HN request time: 0s | source
Show context
dmitrygr ◴[] No.42198106[source]

  >  // Parallel bit count intermediates
  >  let a = v - ((v >> 1) & (u64::MAX / 3));
  >  let b = (a & (u64::MAX / 5)) + ((a >> 2) & (u64::MAX / 5));
  >  let c = (b + (b >> 4)) & (u64::MAX / 0x11);
  >  let d = (c + (c >> 8)) & (u64::MAX / 0x101);

That "parallel bit count" is almost certainly slower than using two POPCNT instructions on a modern cpu. Should just call __builtin_popcount() and let the compiler do it the most optimal way. Luckily, people do this sort of thing so often that many modern compilers will try (and often succeed) to detect you trying this insanity and convert it to a POPCOUNT (or a pair of POPCOUNTs as the case may be here)
replies(2): >>42198289 #>>42199186 #
akoboldfrying ◴[] No.42198289[source]
Which compilers support __builtin_popcount()? From memory, it's a gcc extension. If the compiler selects a CPU POPCOUNT instruction for it, are you sure it will work on all machines that you want to run it on?

The above code is completely source- and binary-portable and reasonably fast -- certainly faster than naively looping through the bits, and within a small constant factor of a CPU POPCOUNT instruction.

replies(4): >>42198351 #>>42198414 #>>42198569 #>>42198679 #
1. jandrewrogers ◴[] No.42198569[source]
Most vaguely recent compilers will convert naively looping through bits into a native POPCOUNT instruction. The parallel bit count algorithm was not reliably detected until more recently and therefore would sometimes produce unoptimized code, though current versions of gcc/clang/msvc can all detect it now.

Also, pretty much every compiler for a very long time has supported __builtin_popcount or equivalent.