Shared_ptr<T>: the (not always) atomic reference counted smart pointer (2019)

(snf.github.io)

1. TillE ◴[31 Aug 25 11:50 UTC] No.45082460[source]▶

> VisualC++ doesn’t have its source code available

Got all the way here and had to look back up to see this post was from 2019. The MSVC standard library has been open source for several years now. https://github.com/microsoft/STL

Though to be perfectly honest, setting a breakpoint and looking at the disassembly is probably easier than reading standard library code.

replies(2): >>45083002 #>>45083530 #

2. mackman ◴[31 Aug 25 11:57 UTC] No.45082494[source]▶

>>45040290 (OP) #

> It is possible to create threads by using the OS syscalls bypassing completely the requirement of pthead. (Un)fortunately, I couldn’t find any popular libraries that implement the functionality by using the syscall interface instead of relying on pthread.

I have tried and failed to do this for a C++ program because the amount of C++ runtime static init/shutdown stuff you would need to deal with isn't practical to implement yourself.

3. ptspts ◴[31 Aug 25 12:03 UTC] No.45082527[source]▶

>>45040290 (OP) #

The correct spelling is in lowercase: shared_ptr<T> . The title of the article is correct, the title of the HN post is incorrect.

replies(2): >>45082838 #>>45082918 #

4. sesuximo ◴[31 Aug 25 12:36 UTC] No.45082719[source]▶

>>45040290 (OP) #

Why is the atomic version slower? Is it slower on modern x86?

replies(2): >>45082778 #>>45083133 #

5. eptcyka ◴[31 Aug 25 12:50 UTC] No.45082778[source]▶

>>45082719 #

Atomic write operations force a cache line flush and can wait until the memory is updated. Atomic reads have to be read from memory or a shared cache. Atomics are slow because memory is slow.

replies(3): >>45082805 #>>45083625 #>>45085604 #

6. Krssst ◴[31 Aug 25 12:57 UTC] No.45082805{3}[source]▶

>>45082778 #

I don't think an atomic operation necessarily demands a cache flush. L1 cache lines can move across cores as needed in my understanding (maybe not on multi-socket machines?). Barriers are required if further memory ordering guarantees are needed.

replies(1): >>45082876 #

7. tialaramex ◴[31 Aug 25 13:02 UTC] No.45082838[source]▶

>>45082527 #

In hindsight this convention seems weird to me by the way. I didn't question it for the decades I was paid money to write C, but after adopting Rust it jumped out more that it's weird how monocase the C and C++ standard libraries are.

Maybe there's a reason I'd never run into, but this seems like a missed opportunity. Even if I have no idea what Goose is, I can see it's a type, that seems like a win.

replies(1): >>45085062 #

8. ot ◴[31 Aug 25 13:07 UTC] No.45082876{4}[source]▶

>>45082805 #

Not a L1/L2/... cache flush, but a store buffer flush, at least on x86. This is true for LOCK instructions. Loads/stores (again on x86) are always acquire/release, so they don't need additional fences if you don't need seq-cst. However, seq-cst atomics in C++ lower stores to LOCK XCHG, so you get a fence.

replies(1): >>45083031 #

9. Waterluvian ◴[31 Aug 25 13:14 UTC] No.45082918[source]▶

>>45082527 #

I believe HN does that automatically.

10. TuxSH ◴[31 Aug 25 13:18 UTC] No.45082945[source]▶

>>45040290 (OP) #

In particular, std::thread constructor's (the non-default one) has a workaround against LTO optimizing the pthread call away: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-...

> Parallelism without pthread

To get __atomic_add_dispatch to work, looks like one is expected to ensure pthread_create is referenced. One way to do it without creating a pthread or std::thread, is to do it outside LTO'd files, or like they did above.

> > It is possible to create threads by using the OS syscalls bypassing completely the requirement of pthead

As the other person said, it is impratical to do so, and it's easier to just reimplement gthread and pthread functions to be hooks (some toolchains do this).

11. tialaramex ◴[31 Aug 25 13:26 UTC] No.45083002[source]▶

>>45082460 #

As STL (nominative determinism at work) points out in the r/cpp thread about this, even when that git repo didn't exist you could have gone to see how this template works because C++ has to monomorphize generics somehow and that means when you write shared_ptr<goose> your C++ compiler needs to compile the source code for shared_ptr with the T replaced by goose.

But you're correct, while I can read https://doc.rust-lang.org/src/alloc/sync.rs.html (where Rust's Arc is defined) ...

... good luck to me in https://github.com/microsoft/STL/blob/main/stl/inc/memory

There are tricks to cope with C++ macros not being hygienic, layered on top of tricks to cope with the fact C++ doesn't have ZSTs, tricks to reduce redundancy in writing all this out for related types, and hacks to improve compiler diagnostics when you do something especially stupid. Do its maintainers learn to read like this? I guess so, as it's Open Source.

replies(1): >>45083532 #

12. tialaramex ◴[31 Aug 25 13:30 UTC] No.45083031{5}[source]▶

>>45082876 #

There is no way the shared_ptr<T> is using the expensive sequentially consistent atomic operations.

Even if you're one of the crazy people who thinks that's the sane default, the value from analysing and choosing a better ordering rule for this key type is enormous and when you do that analysis your answer is going to be acquire-release and only for some edge cases, in many places the relaxed atomic ordering is fine.

replies(3): >>45083183 #>>45084954 #>>45097016 #

13. loeg ◴[31 Aug 25 13:48 UTC] No.45083133[source]▶

>>45082719 #

Agner's instruction manual says "A LOCK prefix typically costs more than a hundred clock cycles," which might be dated but is directionally correct. (The atomic version is LOCK ADD.)

If you go to the CPU-specific tables, LOCK ADD is like 10-50 (Zen 3: 8, Zen 2: 20, Bulldozer: 55, lol) cycles latency vs the expected 1 cycle for regular ADD. And about 10 cycles on Intel CPUs.

So it can be starkly slower on some older AMD platforms, and merely ~10x slower on modern x86 platforms.

replies(1): >>45085100 #

14. loeg ◴[31 Aug 25 13:57 UTC] No.45083183{6}[source]▶

>>45083031 #

> when you do that analysis your answer is going to be acquire-release and only for some edge cases, in many places the relaxed atomic ordering is fine.

Why would shared_ptr refcounting need anything other than relaxed? Acq/rel are for implementing multi-variable atomic protocols, and shared_ptr refcounting simply doesn't have other variables.

replies(3): >>45083552 #>>45083875 #>>45084525 #

15. snfernandez ◴[31 Aug 25 14:44 UTC] No.45083530[source]▶

>>45082460 #

I was working at MS at the time and actually had access to the source code (my project involved devdiv). I don't remember the exact details, but I opted for not adding any of my "private" knowledge to the post.

I agree with you that I prefer looking at optimized assembly with symbols rather than following code through files (which are usually filled with #ifdefs and macros).

16. chuckadams ◴[31 Aug 25 14:45 UTC] No.45083532{3}[source]▶

>>45083002 #

It also helps that the Rust version is lavishly documented with examples, and the C++ version has barely any comments at all.

replies(1): >>45083659 #

17. tialaramex ◴[31 Aug 25 14:47 UTC] No.45083552{7}[source]▶

>>45083183 #

It's extremely difficult to see in real C++ standard library source because of the layers of obfuscating compiler workaround hacks, but eventually they are in fact using acquire-release ordering, but only for decrementing the reference count. Does that help you figure out why we want acquire-release, or do you need more help ?

18. nly ◴[31 Aug 25 14:55 UTC] No.45083625{3}[source]▶

>>45082778 #

This isn't true.

Atomic operations work inside the confines of the cache coherence protocol. Nothing has to be flushed to main memory, or even a lower cache

An atomic operation does something more along the lines of emitting an invalidation, putting the cache line in to an exclusive state, and then ignores find and invalidation requests from other cores while it operates.

19. tialaramex ◴[31 Aug 25 14:58 UTC] No.45083659{4}[source]▶

>>45083532 #

Fair, although slightly cheating because some of the examples in the Rust are literally documentation & that C++ isn't doing the same thing

    /// This is inline markdown documentation in Rust source,

20. dataflow ◴[31 Aug 25 15:22 UTC] No.45083875{7}[source]▶

>>45083183 #

It's because you're not solely managing the refcount here. Other memory locations have a dependence on the refcount, given that you're also deleting the object after the refcount reaches zero. That means you need all writes to have completed at that point, and all reads to observe that. Otherwise you might destroy an object while it's in an invalid state, or you might release the memory while another thread is accessing it.

21. Kranar ◴[31 Aug 25 16:38 UTC] No.45084525{7}[source]▶

>>45083183 #

You need it to avoid a use after free.

22. ibraheemdev ◴[31 Aug 25 17:20 UTC] No.45084954{6}[source]▶

>>45083031 #

> There is no way the shared_ptr<T> is using the expensive sequentially consistent atomic operations.

All RMW operations have sequentially consistent semantics on x86.

It's not exactly a store buffer flush, but any subsequent loads in the pipeline will stall until the store has completed.

replies(1): >>45087519 #

23. recursivecaveat ◴[31 Aug 25 17:33 UTC] No.45085062{3}[source]▶

>>45082838 #

Yeah I don't know why so many C programmers ended up on a convention where case is entirely unused. I wonder if it's some ancient compatability thing that they have just been keeping self-consistent with forever. To me not using case is like designing a road map in black-and-white just because you didn't have any ideas for what colors should represent.

24. Tuna-Fish ◴[31 Aug 25 17:36 UTC] No.45085100{3}[source]▶

>>45083133 #

On modern CPUs atomic adds are now reasonably fast, but only when they are uncontended. If the cache line the value is on has to bounce between cpus, that is usually +100ns (not cycles) or so.

Writing performant parallel code always means absolutely minimizing communication between threads.

replies(1): >>45086736 #

25. namibj ◴[31 Aug 25 18:26 UTC] No.45085604{3}[source]▶

>>45082778 #

Sure atomic reads need to at least check against a shared cache; but atomic writes only have to ensure the write is visible to any other atomic read when the atomic write finishes. And some less-strict memory orderings/consistency models have somewhat weaker prescriptions like needing an explicit write fence/writeback-cache-atomicity-flush to ensure those writes hit globally visible (shared) cache.

There's essentially nothing but DMA/PCIe accesses that won't look at shared/global cache in hopes of read hits before looking at the underlying memory, at least on any system (more specifically, CPU) you'd (want to) run modern Linux on.

There are non-temporal memory accesses where reads don't leave a trace in cache and writes only use a limited amount of cache for some modest writeback "early (reported) completion/throughput-smoothing" effects, as well as some special-purpose memory access types.

For example, on x86, "store-combining": it's a special mode of mapping set as such in the page table entry responsible for that virtual address, where writes use a special store combining buffer (that's typically some single-digit-number of cache lines) local to the core used as a writeback cache so small writes from a loop (like, for example, translating a CPU side pixel buffer to a GPU pixel encoding while at the same time writing through PCIe mapping into VRAM) can accumulate into full cachelines to eliminate any need for read-before-write transfers of those cachelines and also generally to make those writeback transfers more efficient for cases where you go through PCIe/Infiniband/RoCE (and can benefit from typically up to 64 cachelines being bundled together to reduce packet/header overhead).

What is slow, though, at least on some contemporary relevant architectures like Zen3 (just naming because I had checked that in some detail), are single-thread-originated random reads that break the L2 cache's prefetcher (especially if they don't hit any DRAM page twice), because the L1D cache has a fairly limited quantity (for Zen1 [0] and Zen2 [1] I could now find mention of 22) of asynchronous cache-miss-handlers, with random DRAM read latency (assuming you use 1G ("giant") pages and stay in the 32G of DRAM that the 32 entries of L1 TLB can therefore cache) around 50~100ns (especially once some concurrency causes minor congestion at the DDR4 interface) therefore dropping request inverse throughput to around 5ns/cacheline i.e. 12.8 GB/s, a fraction of the (e.g. on spec-conform DDR4-3200 on a mainstream Zen3 desktop processor like a "Ryzen 9 5900") 51.2 GB/s per-CCD (compute chip; a 5950X has 2 of those plus a northbridge; it's the connection to the northbridge that's limiting here) that limits streaming reads (technically it'd be around 2% lower because you'd have to either 100% fill the DDR4 data interface (not quite possible in practice) or add some reads through PCIe (attached to the northbridge's central data hub which doesn't seem to have any throughput limits other than those of the access ports themselves)).

[0]: https://www.7-cpu.com/cpu/Zen.html [1]: https://www.7-cpu.com/cpu/Zen2.html

26. manwe150 ◴[31 Aug 25 20:14 UTC] No.45086657[source]▶

>>45040290 (OP) #

> In conclusion, I’ll assume this is not a typical scenario and it is mostly safe.

Ughh, this brings bad memories of the days I spent trying to diagnose why glibc often would give wrong answers for some users and not other users (they’ve since mitigated this problem slightly by combining pthreads and libdl into the same library). I wish they would get rid of this, since even the comment on it notes that the optimization is unsound (the ability to make syscalls directly, as used by go and others, makes this optimization potentially dangerous). It also upsets static analysis tools, since they see that glibc doesn’t appear to have the synchronization the library promises.

27. loeg ◴[31 Aug 25 20:22 UTC] No.45086736{4}[source]▶

>>45085100 #

Sure, but even the uncontended case is ~10x slower than regular ADD.

28. Kranar ◴[31 Aug 25 22:03 UTC] No.45087519{7}[source]▶

>>45084954 #

It's a common misconception to reason about memory models strictly in terms of hardware.

Sequential consistency is a property of a programming language's semantics and can not simply be inferred from hardware. It is possible for hardware operations to all be SC but for the compiler to still provide weaker memory orderings through compiler specific optimizations.

replies(1): >>45087574 #

29. ibraheemdev ◴[31 Aug 25 22:11 UTC] No.45087574{8}[source]▶

>>45087519 #

I'm referring to the performance implications of the hardware instruction, not the programming language semantics. Incrementing or decrementing the reference count is going to require an RMW instruction, which is expensive on x86 regardless of the ordering.

replies(1): >>45087849 #

30. Kranar ◴[31 Aug 25 22:59 UTC] No.45087849{9}[source]▶

>>45087574 #

The concept of sequential consistency only exists within the context of a programming language's memory model. It makes no sense to speak about the performance of sequentially consistent operations without respect to the semantics of a programming language.

replies(1): >>45088148 #

31. ibraheemdev ◴[31 Aug 25 23:50 UTC] No.45088148{10}[source]▶

>>45087849 #

Yes, what I meant was that the same instruction is generated by the compiler, regardless if the RMW operation is performed with relaxed or sequentially consistent ordering, because that instruction is strong enough in terms of hardware semantics to enforce C++'s definition of sequential consistency.

There is a pretty clear mapping in terms of C++ atomic operations to hardware instructions, and while the C++ memory model is not defined in terms of instruction reordering, that mapping is still useful to talk about performance. Sequential consistency is also a pretty broadly accepted concept outside of the C++ memory model, I think you're being a little too nitpicky on terminology.

replies(1): >>45088566 #

32. Kranar ◴[01 Sep 25 01:17 UTC] No.45088566{11}[source]▶

>>45088148 #

The presentation you are making is both incorrect and highly misleading.

There are algorithms whose correctness depends on sequential consistency which can not be implemented in x86 without explicit barriers, for example Dekker's algorithm.

What x86 does provide is TSO semantics, not sequential consistency.

replies(1): >>45088632 #

33. ibraheemdev ◴[01 Sep 25 01:31 UTC] No.45088632{12}[source]▶

>>45088566 #

I did not claim that x86 provides sequential consistency in general, I made that claim only for RMW operations. Sequentially consistent stores are typically lowered to an XCHG instruction on x86 without an explicit barrier.

From the Intel SDM:

> Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).

34. ot ◴[01 Sep 25 22:04 UTC] No.45097016{6}[source]▶

>>45083031 #

I'm not sure which comment you're responding to, because I'm not talking about shared_ptr, but about how atomic operations in general are implemented on x86.

I don't believe that shared_ptr uses seq-cst because I can just look at the source code, and I know that inc ref is relaxed and dec ref is acq-rel, as they should be.

However, none of this makes a difference on x86, where RMW atomic operations all lower to the same instructions (like LOCK ADD). Loads also do not care about memory order, and stores sometimes do, and that was what my comment was about.

replies(1): >>45100273 #

35. tialaramex ◴[02 Sep 25 07:59 UTC] No.45100273{7}[source]▶

>>45097016 #

This thread is wondering why the MP shared_ptr is slower than SP shared_ptr, or in Rust where this distinction isn't compiler magic, why Arc is slower than Rc

So hence the sequentially consistent ordering doesn't come into the picture.

And yeah, no, you don't get the sequentially consistent ordering for free on x86. x86 has the total store order, but firstly that's not quite enough to deliver sequentially consistent semantics in the machine on its own and then also the compiler has barriers during optimisation and those are impacted too. So if you insist on this ordering (which to be clear again you almost never should, the fact it's the default in C++ is IMO a mistake) it does make a difference on x86.

36. marthacamila ◴[02 Sep 25 17:07 UTC] No.45105911[source]▶

>>45040290 (OP) #

My spouse has a million passwords on his phone which makes it highly out of bounds, I knew he was cheating on me and all I needed was some evidence. Medialord really made it a piece of cake catching him in the act, he helped me install some monitoring spyware that was so easy for me to use (as I am a computer dummy) all I had to do was login to see the info. He helped me through the whole process and now I have enough evidence for my divorce case. Am sure he can help you if you have similar issue, its as easy as sending him a mail HACKSECRETE@ GMAILCOM , you can also reach him if you could not withdraw your funds from online trading platform such as expert-option ,cal financial, Analyst , coinspot, Ctxprime and many more. he’s reliable and affordable.