Even if you're one of the crazy people who thinks that's the sane default, the value from analysing and choosing a better ordering rule for this key type is enormous and when you do that analysis your answer is going to be acquire-release and only for some edge cases, in many places the relaxed atomic ordering is fine.
Why would shared_ptr refcounting need anything other than relaxed? Acq/rel are for implementing multi-variable atomic protocols, and shared_ptr refcounting simply doesn't have other variables.
All RMW operations have sequentially consistent semantics on x86.
It's not exactly a store buffer flush, but any subsequent loads in the pipeline will stall until the store has completed.
Sequential consistency is a property of a programming language's semantics and can not simply be inferred from hardware. It is possible for hardware operations to all be SC but for the compiler to still provide weaker memory orderings through compiler specific optimizations.
There is a pretty clear mapping in terms of C++ atomic operations to hardware instructions, and while the C++ memory model is not defined in terms of instruction reordering, that mapping is still useful to talk about performance. Sequential consistency is also a pretty broadly accepted concept outside of the C++ memory model, I think you're being a little too nitpicky on terminology.
There are algorithms whose correctness depends on sequential consistency which can not be implemented in x86 without explicit barriers, for example Dekker's algorithm.
What x86 does provide is TSO semantics, not sequential consistency.
From the Intel SDM:
> Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).
I don't believe that shared_ptr uses seq-cst because I can just look at the source code, and I know that inc ref is relaxed and dec ref is acq-rel, as they should be.
However, none of this makes a difference on x86, where RMW atomic operations all lower to the same instructions (like LOCK ADD). Loads also do not care about memory order, and stores sometimes do, and that was what my comment was about.
So hence the sequentially consistent ordering doesn't come into the picture.
And yeah, no, you don't get the sequentially consistent ordering for free on x86. x86 has the total store order, but firstly that's not quite enough to deliver sequentially consistent semantics in the machine on its own and then also the compiler has barriers during optimisation and those are impacted too. So if you insist on this ordering (which to be clear again you almost never should, the fact it's the default in C++ is IMO a mistake) it does make a difference on x86.