Shared_ptr<T>: the (not always) atomic reference counted smart pointer (2019)

(snf.github.io)

51 points klaussilveira | 1 comments | 27 Aug 25 14:34 UTC | HN request time: 0s | source

Show context

sesuximo ◴[31 Aug 25 12:36 UTC] No.45082719[source]▶

>>45040290 (OP) #

Why is the atomic version slower? Is it slower on modern x86?

replies(2): >>45082778 #>>45083133 #

eptcyka ◴[31 Aug 25 12:50 UTC] No.45082778[source]▶

>>45082719 #

Atomic write operations force a cache line flush and can wait until the memory is updated. Atomic reads have to be read from memory or a shared cache. Atomics are slow because memory is slow.

replies(3): >>45082805 #>>45083625 #>>45085604 #

Krssst ◴[31 Aug 25 12:57 UTC] No.45082805[source]▶

>>45082778 #

I don't think an atomic operation necessarily demands a cache flush. L1 cache lines can move across cores as needed in my understanding (maybe not on multi-socket machines?). Barriers are required if further memory ordering guarantees are needed.

replies(1): >>45082876 #

ot ◴[31 Aug 25 13:07 UTC] No.45082876[source]▶

>>45082805 #

Not a L1/L2/... cache flush, but a store buffer flush, at least on x86. This is true for LOCK instructions. Loads/stores (again on x86) are always acquire/release, so they don't need additional fences if you don't need seq-cst. However, seq-cst atomics in C++ lower stores to LOCK XCHG, so you get a fence.

replies(1): >>45083031 #

tialaramex ◴[31 Aug 25 13:30 UTC] No.45083031[source]▶

>>45082876 #

There is no way the shared_ptr<T> is using the expensive sequentially consistent atomic operations.

Even if you're one of the crazy people who thinks that's the sane default, the value from analysing and choosing a better ordering rule for this key type is enormous and when you do that analysis your answer is going to be acquire-release and only for some edge cases, in many places the relaxed atomic ordering is fine.

replies(3): >>45083183 #>>45084954 #>>45097016 #

ot ◴[01 Sep 25 22:04 UTC] No.45097016[source]▶

>>45083031 #

I'm not sure which comment you're responding to, because I'm not talking about shared_ptr, but about how atomic operations in general are implemented on x86.

I don't believe that shared_ptr uses seq-cst because I can just look at the source code, and I know that inc ref is relaxed and dec ref is acq-rel, as they should be.

However, none of this makes a difference on x86, where RMW atomic operations all lower to the same instructions (like LOCK ADD). Loads also do not care about memory order, and stores sometimes do, and that was what my comment was about.

replies(1): >>45100273 #

1. tialaramex ◴[02 Sep 25 07:59 UTC] No.45100273[source]▶

>>45097016 #

This thread is wondering why the MP shared_ptr is slower than SP shared_ptr, or in Rust where this distinction isn't compiler magic, why Arc is slower than Rc

So hence the sequentially consistent ordering doesn't come into the picture.

And yeah, no, you don't get the sequentially consistent ordering for free on x86. x86 has the total store order, but firstly that's not quite enough to deliver sequentially consistent semantics in the machine on its own and then also the compiler has barriers during optimisation and those are impacted too. So if you insist on this ordering (which to be clear again you almost never should, the fact it's the default in C++ is IMO a mistake) it does make a difference on x86.

↑