It goes back to the CDC 6600 at least, and is most often seen as part of Hamming distance computation (pop(xor(x,y))). But it turns out to be really useful for other things (trailing zero count), and worth having in hardware since the software sequence is a ~dozen instructions for 64 bits.