←back to thread

125 points todsacerdoti | 1 comments | | HN request time: 0.202s | source
Show context
adrian_b ◴[] No.45039915[source]
In my opinion, NOP and MOV, which are recommended in TFA for slowing down, are the worst possible choices.

The authors have tested a rather obsolete CPU, with a 10-year-old Skylake microarchitecture, but more recent Intel/AMD CPUs have special optimizations for both NOP and MOV, executing them at the renaming stage, well before the normal execution units, so they may appear to have been executed in zero time.

For slowing down, one could use something really slow, like integer division. If that would interfere with the desired register usage, other reliable choices would be add with carry or perhaps complement carry flag. If it is not desired to modify the flags, one can use a RORX instruction for multiple bit rotation (available since Haswell, but not in older Atom CPUs), or one could execute BSWAP (available since 1989, therefore it exists in all 64-bit CPUs, including any Atom).

replies(5): >>45040023 #>>45040643 #>>45040766 #>>45040790 #>>45042365 #
loeg ◴[] No.45040023[source]
RDTSC(P) is pretty slow. I wonder if that would work.
replies(1): >>45040260 #
1. adrian_b ◴[] No.45040260[source]
RDTSC or RDTSCP, like also CPUID, work certainly very well for slowing down a program.

However, like integer division, they may clobber registers that the program wants to use for other purposes.

For great slow-downs when the clobbered registers do not matter, I think that CPUID is the best, as it serializes the execution and it has a long execution time on all CPUs.

For small slow-downs I think that BSWAP is a good choice, as it modifies only 1 arbitrary register without affecting the flags, and it also is a less usual instruction so it is unlikely that it will ever receive special optimizations, like NOP and MOV.

However, multiple BSWAPs must be used, to occupy all available execution ports, otherwise if there is any execution port not occupied by the rest of the program the BSWAP may be executed concurrently, not requiring any extra time.