It depends, but generally speaking, you are wrong and OP is right that you'd want to benchmark on the actual architecture.
a) First of all, you're probably basing your answer off of experience with 64-bit popcounts. But note the question was about popcounting multiple 16-bit words, not single 64-bit words. This isn't typically what you do in a chessprogram. b) The table has a cache footprint and can be pushed out of L1, which kills that approach in many real programs. c) Modern CPUs have a POPCOUNT instruction. It's slow and limited to one port on most Intel machines, though, so not necessarily always a win either. d) Lacking POPCOUNT, and with cache pressure, the SWAR approaches are good, especially if you can compute multiple results at once. With AVX2 it becomes especially interesting. f) If the the expectation is that many of the numbers are zero, a simple loop will win.
Advanced manually-vectorized SIMD code worked several times faster that lookup, esp. on the hardware that supports SSSE3 or XOP instructions.
And even when the hardware doesn’t have SSE4, doesn’t have SSSE3, doesn’t have XOP — SSE2-only backup plan is still faster than lookup tables. Here’s the code: http://stackoverflow.com/a/17355341/126995