So… someone care to explain to me where I'm going egregiously wrong, apparently? Because, at least on my machine with the code I spewed out quickly, the Kerningham way of counting bits is ~6x slower than a lookup table (including generating the table itself).
The Kerningham way code seems faster for very sparse arrays (i.e. only one bit set per uint16), but slower otherwise.
replies(1):