Why is the atomic version slower? Is it slower on modern x86?
replies(2):
If you go to the CPU-specific tables, LOCK ADD is like 10-50 (Zen 3: 8, Zen 2: 20, Bulldozer: 55, lol) cycles latency vs the expected 1 cycle for regular ADD. And about 10 cycles on Intel CPUs.
So it can be starkly slower on some older AMD platforms, and merely ~10x slower on modern x86 platforms.
Writing performant parallel code always means absolutely minimizing communication between threads.