How to slow down a program and why it can be useful

1. adrian_b ◴[27 Aug 25 14:05 UTC] No.45039915[source]▶

In my opinion, NOP and MOV, which are recommended in TFA for slowing down, are the worst possible choices.

The authors have tested a rather obsolete CPU, with a 10-year-old Skylake microarchitecture, but more recent Intel/AMD CPUs have special optimizations for both NOP and MOV, executing them at the renaming stage, well before the normal execution units, so they may appear to have been executed in zero time.

For slowing down, one could use something really slow, like integer division. If that would interfere with the desired register usage, other reliable choices would be add with carry or perhaps complement carry flag. If it is not desired to modify the flags, one can use a RORX instruction for multiple bit rotation (available since Haswell, but not in older Atom CPUs), or one could execute BSWAP (available since 1989, therefore it exists in all 64-bit CPUs, including any Atom).

replies(5): >>45040023 #>>45040643 #>>45040766 #>>45040790 #>>45042365 #

2. loeg ◴[27 Aug 25 14:13 UTC] No.45040023[source]▶

>>45039915 (TP) #

RDTSC(P) is pretty slow. I wonder if that would work.

replies(1): >>45040260 #

3. adrian_b ◴[27 Aug 25 14:31 UTC] No.45040260[source]▶

>>45040023 #

RDTSC or RDTSCP, like also CPUID, work certainly very well for slowing down a program.

However, like integer division, they may clobber registers that the program wants to use for other purposes.

For great slow-downs when the clobbered registers do not matter, I think that CPUID is the best, as it serializes the execution and it has a long execution time on all CPUs.

For small slow-downs I think that BSWAP is a good choice, as it modifies only 1 arbitrary register without affecting the flags, and it also is a less usual instruction so it is unlikely that it will ever receive special optimizations, like NOP and MOV.

However, multiple BSWAPs must be used, to occupy all available execution ports, otherwise if there is any execution port not occupied by the rest of the program the BSWAP may be executed concurrently, not requiring any extra time.

4. fanf2 ◴[27 Aug 25 15:00 UTC] No.45040643[source]▶

>>45039915 (TP) #

They care about exact slowdown factors, and they don’t like the instructions that use any of the CPU’s execution resources because the program’s real work interferes with the slowdown factor making it harder to control.

The NOPs in effect use up a small fraction of the instruction decode bandwidth, and if they insert enough NOPs they can reduce the number of real instructions that are issued per cycle and slow down the program with a fine degree of precision.

replies(1): >>45041830 #

5. clausecker ◴[27 Aug 25 15:07 UTC] No.45040766[source]▶

>>45039915 (TP) #

Skylake has the exact same optimisations.

6. api ◴[27 Aug 25 15:09 UTC] No.45040790[source]▶

>>45039915 (TP) #

There's a whole niche in cryptography called verifiable delay functions (VDFs) if you want a big rabbit hole to go down.

The idea is that these are unavoidably slow and preferably non-parallelizable to compute one way, but fast or near instantaneous to verify the result. Examples include the Weslowski VDFs based on similar math to RSA, MIMC, and the use of zero knowledge proofs to provide proof of slow computations.

replies(2): >>45041737 #>>45042543 #

7. thfuran ◴[27 Aug 25 16:23 UTC] No.45041737[source]▶

>>45040790 #

For this purpose though, computation should be minimized to avoid spilling registers and having more complicated performance side effects than just the intended direct delay.

8. adrian_b ◴[27 Aug 25 16:30 UTC] No.45041830[source]▶

>>45040643 #

Many recent CPUs eliminate NOPs without executing them, so using NOP may work on a CPU but be ineffective on another.

A double BSWAP is equivalent with a NOP, but no existing CPU does any special optimization for it and it is very unlikely that any future CPU would do something special with it, as it is mostly obsolete (nowadays MOVBE is preferable instead of it).

NOP cannot ensure a certain slow-down factor, except on exactly the same model of CPU.

replies(3): >>45041923 #>>45042699 #>>45042758 #

9. ◴[27 Aug 25 16:39 UTC] No.45041923{3}[source]▶

>>45041830 #

10. sidewndr46 ◴[27 Aug 25 17:17 UTC] No.45042365[source]▶

>>45039915 (TP) #

From what I understand if you really want to mess with some performance stuff just throw a random AVX or even better AVX512 workload in the critical path of all cores. Could be something really simple, like expanding out a normal cmov, jmp statement, or just regular integer addition to be needlessly AVX involved. Depending on the CPU this can apparently downclock the core, change the power delivery, and potentially result in pipeline stalls on other cores on the same die.

11. 01HNNWZ0MV43FF ◴[27 Aug 25 17:35 UTC] No.45042543[source]▶

>>45040790 #

Those sound like the space-hard and time-hard hashes, like used for password checking

replies(1): >>45043098 #

12. 201984 ◴[27 Aug 25 17:47 UTC] No.45042699{3}[source]▶

>>45041830 #

Even modern CPUs still have to read the NOPs to skip them, and that's what he was talking about. Decode bandwidth.

13. yorwba ◴[27 Aug 25 17:52 UTC] No.45042758{3}[source]▶

>>45041830 #

> Many recent CPUs eliminate NOPs without executing them

The elimination will still take some amount of time, and the smaller this amount is, the better, because it allows dialing in the slowdown more precisely. Of course how many NOPs you need for a particular basic block needs to be measured on the CPU you're profiling on, but that's also the case if you use a more consistently slow instruction, since the other instructions won't necessarily maintain a consistent speed relative to it.

replies(1): >>45043947 #

14. api ◴[27 Aug 25 18:18 UTC] No.45043098{3}[source]▶

>>45042543 #

They're a relative of that, but have stronger linearity and asymmetrical verification time guarantees AFAIK.

15. sweetjuly ◴[27 Aug 25 19:27 UTC] No.45043947{4}[source]▶

>>45042758 #

That's not necessarily true. Many modern CPUs have uop caches, and it's reasonable to assume implementations will eliminate NOPs before placing lines in the uop cache. A hit on the uop cache loses any slowdown you hoped to achieve since now the NOPs are neither fetched nor decoded.

replies(1): >>45044219 #

16. tux3 ◴[27 Aug 25 19:47 UTC] No.45044219{5}[source]▶

>>45043947 #

That's an interesting question, because the uop cache is filled early in the pipeline, but zero-idioms are only detected around the rename stage.

They might be able to skip a plain 0x90, but something like mov rax, rax would still emit a uop for the cache, before being eliminated later in rename. So at best it would be a fairly limited optimization.

It's also nice because rename is a very predictable choke point, no matter what the rest of the frontend and the backend are busy doing.