Shared_ptr<T>: the (not always) atomic reference counted smart pointer (2019)

(snf.github.io)

51 points klaussilveira | 4 comments | 27 Aug 25 14:34 UTC | HN request time: 0.001s | source

Show context

sesuximo ◴[31 Aug 25 12:36 UTC] No.45082719[source]▶

>>45040290 (OP) #

Why is the atomic version slower? Is it slower on modern x86?

replies(2): >>45082778 #>>45083133 #

eptcyka ◴[31 Aug 25 12:50 UTC] No.45082778[source]▶

>>45082719 #

Atomic write operations force a cache line flush and can wait until the memory is updated. Atomic reads have to be read from memory or a shared cache. Atomics are slow because memory is slow.

replies(3): >>45082805 #>>45083625 #>>45085604 #

Krssst ◴[31 Aug 25 12:57 UTC] No.45082805[source]▶

>>45082778 #

I don't think an atomic operation necessarily demands a cache flush. L1 cache lines can move across cores as needed in my understanding (maybe not on multi-socket machines?). Barriers are required if further memory ordering guarantees are needed.

replies(1): >>45082876 #

ot ◴[31 Aug 25 13:07 UTC] No.45082876[source]▶

>>45082805 #

Not a L1/L2/... cache flush, but a store buffer flush, at least on x86. This is true for LOCK instructions. Loads/stores (again on x86) are always acquire/release, so they don't need additional fences if you don't need seq-cst. However, seq-cst atomics in C++ lower stores to LOCK XCHG, so you get a fence.

replies(1): >>45083031 #

tialaramex ◴[31 Aug 25 13:30 UTC] No.45083031[source]▶

>>45082876 #

There is no way the shared_ptr<T> is using the expensive sequentially consistent atomic operations.

Even if you're one of the crazy people who thinks that's the sane default, the value from analysing and choosing a better ordering rule for this key type is enormous and when you do that analysis your answer is going to be acquire-release and only for some edge cases, in many places the relaxed atomic ordering is fine.

replies(3): >>45083183 #>>45084954 #>>45097016 #

ibraheemdev ◴[31 Aug 25 17:20 UTC] No.45084954[source]▶

>>45083031 #

> There is no way the shared_ptr<T> is using the expensive sequentially consistent atomic operations.

All RMW operations have sequentially consistent semantics on x86.

It's not exactly a store buffer flush, but any subsequent loads in the pipeline will stall until the store has completed.

replies(1): >>45087519 #

Kranar ◴[31 Aug 25 22:03 UTC] No.45087519[source]▶

>>45084954 #

It's a common misconception to reason about memory models strictly in terms of hardware.

Sequential consistency is a property of a programming language's semantics and can not simply be inferred from hardware. It is possible for hardware operations to all be SC but for the compiler to still provide weaker memory orderings through compiler specific optimizations.

replies(1): >>45087574 #

ibraheemdev ◴[31 Aug 25 22:11 UTC] No.45087574{3}[source]▶

>>45087519 #

I'm referring to the performance implications of the hardware instruction, not the programming language semantics. Incrementing or decrementing the reference count is going to require an RMW instruction, which is expensive on x86 regardless of the ordering.

replies(1): >>45087849 #

1. Kranar ◴[31 Aug 25 22:59 UTC] No.45087849{4}[source]▶

>>45087574 #

The concept of sequential consistency only exists within the context of a programming language's memory model. It makes no sense to speak about the performance of sequentially consistent operations without respect to the semantics of a programming language.

replies(1): >>45088148 #

2. ibraheemdev ◴[31 Aug 25 23:50 UTC] No.45088148[source]▶

>>45087849 (TP) #

Yes, what I meant was that the same instruction is generated by the compiler, regardless if the RMW operation is performed with relaxed or sequentially consistent ordering, because that instruction is strong enough in terms of hardware semantics to enforce C++'s definition of sequential consistency.

There is a pretty clear mapping in terms of C++ atomic operations to hardware instructions, and while the C++ memory model is not defined in terms of instruction reordering, that mapping is still useful to talk about performance. Sequential consistency is also a pretty broadly accepted concept outside of the C++ memory model, I think you're being a little too nitpicky on terminology.

replies(1): >>45088566 #

3. Kranar ◴[01 Sep 25 01:17 UTC] No.45088566[source]▶

>>45088148 #

The presentation you are making is both incorrect and highly misleading.

There are algorithms whose correctness depends on sequential consistency which can not be implemented in x86 without explicit barriers, for example Dekker's algorithm.

What x86 does provide is TSO semantics, not sequential consistency.

replies(1): >>45088632 #

4. ibraheemdev ◴[01 Sep 25 01:31 UTC] No.45088632{3}[source]▶

>>45088566 #

I did not claim that x86 provides sequential consistency in general, I made that claim only for RMW operations. Sequentially consistent stores are typically lowered to an XCHG instruction on x86 without an explicit barrier.

From the Intel SDM:

> Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).

↑