#9 is especially stupid because it's so context-dependent. SSE4 gives you a popcount instruction, for example, which would be easily the fastest way to do this, if available.
Which is why you ask follow-up questions instead of just giving the optimal solution for UltraSPARC and rejecting what would be the optimal solution for other CPUs.