(snf.github.io)

51 points klaussilveira | 2 comments | 27 Aug 25 14:34 UTC | HN request time: 0.413s | source

Show context

sesuximo ◴[31 Aug 25 12:36 UTC] No.45082719[source]▶

Why is the atomic version slower? Is it slower on modern x86?

loeg ◴[31 Aug 25 13:48 UTC] No.45083133[source]▶

Agner's instruction manual says "A LOCK prefix typically costs more than a hundred clock cycles," which might be dated but is directionally correct. (The atomic version is LOCK ADD.)

If you go to the CPU-specific tables, LOCK ADD is like 10-50 (Zen 3: 8, Zen 2: 20, Bulldozer: 55, lol) cycles latency vs the expected 1 cycle for regular ADD. And about 10 cycles on Intel CPUs.

So it can be starkly slower on some older AMD platforms, and merely ~10x slower on modern x86 platforms.

replies(1): >>45085100 #

1. Tuna-Fish ◴[31 Aug 25 17:36 UTC] No.45085100[source]▶

>>45083133 #

On modern CPUs atomic adds are now reasonably fast, but only when they are uncontended. If the cache line the value is on has to bounce between cpus, that is usually +100ns (not cycles) or so.

Writing performant parallel code always means absolutely minimizing communication between threads.

replies(1): >>45086736 #

2. loeg ◴[31 Aug 25 20:22 UTC] No.45086736[source]▶

>>45085100 (TP) #

Sure, but even the uncontended case is ~10x slower than regular ADD.

↑

Shared_ptr<T>: the (not always) atomic reference counted smart pointer (2019)