Most active commenters

Bit-twiddling optimizations in Zed's Rope

(zed.dev)

Show context

dmitrygr ◴[20 Nov 24 21:06 UTC] No.42198106[source]▶

  >  // Parallel bit count intermediates
  >  let a = v - ((v >> 1) & (u64::MAX / 3));
  >  let b = (a & (u64::MAX / 5)) + ((a >> 2) & (u64::MAX / 5));
  >  let c = (b + (b >> 4)) & (u64::MAX / 0x11);
  >  let d = (c + (c >> 8)) & (u64::MAX / 0x101);

That "parallel bit count" is almost certainly slower than using two POPCNT instructions on a modern cpu. Should just call __builtin_popcount() and let the compiler do it the most optimal way. Luckily, people do this sort of thing so often that many modern compilers will try (and often succeed) to detect you trying this insanity and convert it to a POPCOUNT (or a pair of POPCOUNTs as the case may be here)

replies(2): >>42198289 #>>42199186 #

1. akoboldfrying ◴[20 Nov 24 21:27 UTC] No.42198289[source]▶

>>42198106 #

Which compilers support __builtin_popcount()? From memory, it's a gcc extension. If the compiler selects a CPU POPCOUNT instruction for it, are you sure it will work on all machines that you want to run it on?

The above code is completely source- and binary-portable and reasonably fast -- certainly faster than naively looping through the bits, and within a small constant factor of a CPU POPCOUNT instruction.

replies(7): >>42198351 #>>42198414 #>>42198569 #>>42198679 #>>42203248 #>>42206307 #>>42206672 #

2. woadwarrior01 ◴[20 Nov 24 21:35 UTC] No.42198351[source]▶

>>42198289 (TP) #

> Which compilers support __builtin_popcount()?

Clang supports __builtin_popcount() too. And MSVC has __popcnt().

3. dmitrygr ◴[20 Nov 24 21:43 UTC] No.42198414[source]▶

>>42198289 (TP) #

Your compiler will know the best way to popcount, that is the point of that builtin. It'll use the best method - sometimes this one. GCC does this, MSVC does this, clang does this, i think even rust has some way to do it (EDIT: it does: count_ones()). On archs which lack POPCNT, it will use this method or another, based on knowing the target. On x86 this approach is OK as is. On arm64, for example, it will be suboptimal due to all the literals needed. On armv6m, this method is bad and table lookups are faster.

replies(2): >>42198709 #>>42199348 #

4. jandrewrogers ◴[20 Nov 24 22:02 UTC] No.42198569[source]▶

>>42198289 (TP) #

Most vaguely recent compilers will convert naively looping through bits into a native POPCOUNT instruction. The parallel bit count algorithm was not reliably detected until more recently and therefore would sometimes produce unoptimized code, though current versions of gcc/clang/msvc can all detect it now.

Also, pretty much every compiler for a very long time has supported __builtin_popcount or equivalent.

5. aseipp ◴[20 Nov 24 22:15 UTC] No.42198679[source]▶

>>42198289 (TP) #

Everything supports __builtin_popcount or some variant these days (__popcnt for MSVC). It's a complete non-issue, really.

And the compiler is not required to lower it to a single instruction. It will if the target architecture is specified appropriately, but there's nothing that says it has to explode if it can't. In fact, by doing it this way, the compiler is actually more free to generate code in a way that's optimal for the architecture in all cases, because all the implementation details are hidden e.g. loads of large constants may be avoided if the compiler is allowed to choose the exact implementation, while using the portable version may tie its hands more depending on how it's feeling on that day. Here's __builtin_popcount working just fine while targeting a ~20yo architecture without native support for SSE4.2; it can generate this code knowing what the proper instructions and schedules are: https://godbolt.org/z/ra7n5T5f3

The moral here is that the primitives are there for you to use. Just use them and save yourself and your would-be code reviewer's time.

6. SkiFire13 ◴[20 Nov 24 22:17 UTC] No.42198709[source]▶

>>42198414 #

Note that by default rustc targets x86-64-v1 when compiling for x86-64, and that lacks the popcount instruction. You need to change the target_cpu to at least x86-64-v2 or enable the popcount target_feature. This means that even if your cpu is relatively new and you intend to run your code on relatively new cpus, rustc will still generate older and slower code for count_ones() using bitshifts and masks. That said, I don't see the point in writing them manually if the compiler can generate them for you.

replies(1): >>42200220 #

7. Findecanor ◴[20 Nov 24 23:42 UTC] No.42199348[source]▶

>>42198414 #

I once wrote that algorithm, divided into single lines, intending each line to be a single 64-bit ARM instruction. The compiler did idiom detection, transforming it to "builtin popcnt" and (because 64-bit ARMv8.0 lacks a POPCNT instruction) back to the same algorithm. Only that the emitted code was one instruction larger than my code.

64-bit ARM's actually has a very peculiar encoding of immediates to arithmetic instructions. It supports only recurring bit patterns such as used by this algorithm. For example "add x2, x3, #3333333333333333" is encoded as one four-byte instruction.

replies(1): >>42199491 #

8. stassats ◴[20 Nov 24 23:59 UTC] No.42199491{3}[source]▶

>>42199348 #

> because 64-bit ARMv8.0 lacks a POPCNT instruction

It does have this: https://developer.arm.com/documentation/ddi0596/2021-09/SIMD...

And GCC happily uses it https://godbolt.org/z/dTW46f9Kf

9. vlovich123 ◴[21 Nov 24 01:47 UTC] No.42200220{3}[source]▶

>>42198709 #

It's not unreasonable to think that Rust will change the minimum version and you should always override the target cpu anyway for C++-like toolchains when building production binaries (`-Ctarget-cpu` for rust, `march=` for clang/gcc).

10. SuchAnonMuchWow ◴[21 Nov 24 11:32 UTC] No.42203248[source]▶

>>42198289 (TP) #

In addition to the other comments, the iso C23 standard added the <stdbit.h> header to the standard library with a stdc_count_ones() function, so compiler support will become standard.

11. Validark ◴[21 Nov 24 17:02 UTC] No.42206307[source]▶

>>42198289 (TP) #

I guess it depends how many different compilers you are targeting but generally speaking, compilers look for trivial implementations of popcount and can replace it with the single instruction.

However, as someone already mentioned, this is not equivalent to a pair of popcounts.

12. 3836293648 ◴[21 Nov 24 17:41 UTC] No.42206672[source]▶

>>42198289 (TP) #

The only one that exists. This is rust and rustc supports popcnt on all platforms that have it and emulates it on those that don't

↑