Sure atomic reads need to at least check against a shared cache; but atomic writes only have to ensure the write is visible to any other atomic read when the atomic write finishes. And some less-strict memory orderings/consistency models have somewhat weaker prescriptions like needing an explicit write fence/writeback-cache-atomicity-flush to ensure those writes hit
globally visible (shared) cache.
There's essentially nothing but DMA/PCIe accesses that won't look at shared/global cache in hopes of read hits before looking at the underlying memory, at least on any system (more specifically, CPU) you'd (want to) run modern Linux on.
There are non-temporal memory accesses where reads don't leave a trace in cache and writes only use a limited amount of cache for some modest writeback "early (reported) completion/throughput-smoothing" effects, as well as some special-purpose memory access types.
For example, on x86, "store-combining":
it's a special mode of mapping set as such in the page table entry responsible for that virtual address, where writes use a special store combining buffer (that's typically some single-digit-number of cache lines) local to the core used as a writeback cache so small writes from a loop (like, for example, translating a CPU side pixel buffer to a GPU pixel encoding while at the same time writing through PCIe mapping into VRAM) can accumulate into full cachelines to eliminate any need for read-before-write transfers of those cachelines and also generally to make those writeback transfers more efficient for cases where you go through PCIe/Infiniband/RoCE (and can benefit from typically up to 64 cachelines being bundled together to reduce packet/header overhead).
What is slow, though, at least on some contemporary relevant architectures like Zen3 (just naming because I had checked that in some detail), are single-thread-originated random reads that break the L2 cache's prefetcher (especially if they don't hit any DRAM page twice), because the L1D cache has a fairly limited quantity (for Zen1 [0] and Zen2 [1] I could now find mention of 22) of asynchronous cache-miss-handlers, with random DRAM read latency (assuming you use 1G ("giant") pages and stay in the 32G of DRAM that the 32 entries of L1 TLB can therefore cache) around 50~100ns (especially once some concurrency causes minor congestion at the DDR4 interface) therefore dropping request inverse throughput to around 5ns/cacheline i.e. 12.8 GB/s, a fraction of the (e.g. on spec-conform DDR4-3200 on a mainstream Zen3 desktop processor like a "Ryzen 9 5900") 51.2 GB/s per-CCD (compute chip; a 5950X has 2 of those plus a northbridge; it's the connection to the northbridge that's limiting here) that limits streaming reads (technically it'd be around 2% lower because you'd have to either 100% fill the DDR4 data interface (not quite possible in practice) or add some reads through PCIe (attached to the northbridge's central data hub which doesn't seem to have any throughput limits other than those of the access ports themselves)).
[0]: https://www.7-cpu.com/cpu/Zen.html
[1]: https://www.7-cpu.com/cpu/Zen2.html