RDTSC or RDTSCP, like also CPUID, work certainly very well for slowing down a program.
However, like integer division, they may clobber registers that the program wants to use for other purposes.
For great slow-downs when the clobbered registers do not matter, I think that CPUID is the best, as it serializes the execution and it has a long execution time on all CPUs.
For small slow-downs I think that BSWAP is a good choice, as it modifies only 1 arbitrary register without affecting the flags, and it also is a less usual instruction so it is unlikely that it will ever receive special optimizations, like NOP and MOV.
However, multiple BSWAPs must be used, to occupy all available execution ports, otherwise if there is any execution port not occupied by the rest of the program the BSWAP may be executed concurrently, not requiring any extra time.