←back to thread

1764 points fatihky | 5 comments | | HN request time: 0.656s | source
1. greyman ◴[] No.12702663[source]
I once coded chess playing algorithm for fun, and can confirm that the recruiter was correct on #9: you count bits by using a lookup table and then sum the results. It's the quickest way. But I am not sure if this is possible to figure out immediately without such experience...
replies(3): >>12702773 #>>12703527 #>>12708596 #
2. djmips ◴[] No.12702773[source]
can you really confirm that? I'd like to see your code.
3. gcp ◴[] No.12703527[source]
Ah, the appeal to authority.

It depends, but generally speaking, you are wrong and OP is right that you'd want to benchmark on the actual architecture.

a) First of all, you're probably basing your answer off of experience with 64-bit popcounts. But note the question was about popcounting multiple 16-bit words, not single 64-bit words. This isn't typically what you do in a chessprogram. b) The table has a cache footprint and can be pushed out of L1, which kills that approach in many real programs. c) Modern CPUs have a POPCOUNT instruction. It's slow and limited to one port on most Intel machines, though, so not necessarily always a win either. d) Lacking POPCOUNT, and with cache pressure, the SWAR approaches are good, especially if you can compute multiple results at once. With AVX2 it becomes especially interesting. f) If the the expectation is that many of the numbers are zero, a simple loop will win.

replies(1): >>12712776 #
4. Const-me ◴[] No.12708596[source]
I recently tested different approaches. I’ve been working on some code that downsamples large set of 1 bit voxels to get shades of gray on the edges. For that, I had to counts gigabytes of those bits as fast as possible.

Advanced manually-vectorized SIMD code worked several times faster that lookup, esp. on the hardware that supports SSSE3 or XOP instructions.

And even when the hardware doesn’t have SSE4, doesn’t have SSSE3, doesn’t have XOP — SSE2-only backup plan is still faster than lookup tables. Here’s the code: http://stackoverflow.com/a/17355341/126995

5. bogomipz ◴[] No.12712776[source]
>"Lacking POPCOUNT, and with cache pressure, the SWAR approaches are good, especially if you can compute multiple results at once"

Can you explain what a SWAR approach is?