Most active commenters

scottlamb(6)

Popular/hot comments

>>39445433 #

←back to thread

SSDs have become fast, except in the cloud

(databasearchitects.blogspot.com)

Show context

pclmulqdq ◴[20 Feb 24 17:22 UTC] No.39443994[source]▶

>>39443679 (OP) #

This was a huge technical problem I worked on at Google, and is sort of fundamental to a cloud. I believe this is actually a big deal that drives peoples' technology directions.

SSDs in the cloud are attached over a network, and fundamentally have to be. The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

replies(30): >>39444009 #>>39444024 #>>39444028 #>>39444046 #>>39444062 #>>39444085 #>>39444096 #>>39444099 #>>39444120 #>>39444138 #>>39444328 #>>39444374 #>>39444396 #>>39444429 #>>39444655 #>>39444952 #>>39445035 #>>39445917 #>>39446161 #>>39446248 #>>39447169 #>>39447467 #>>39449080 #>>39449287 #>>39449377 #>>39449994 #>>39450169 #>>39450172 #>>39451330 #>>39466088 #

1. scottlamb ◴[20 Feb 24 18:33 UTC] No.39444952[source]▶

>>39443994 #

> The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

Certainly true that SSD bandwidth and latency improvements are hard to match, but I don't understand why intra-datacenter network latency in particular is so bad. This ~2020-I-think version of the "Latency Numbers Everyone Should Know" says 0.5 ms round trip (and mentions "10 Gbps network" on another line). [1] It was the same thing in a 2012 version (that only mentions "1 Gbps network"). [2] Why no improvement? I think that 2020 version might have been a bit conservative on this line, and nice datacenters may even have multiple 100 Gbit/sec NICs per machine in 2024, but still I think the round trip actually is strangely bad.

I've seen experimental networking stuff (e.g. RDMA) that claims significantly better latency, so I don't think it's a physical limitation of the networking gear but rather something at the machine/OS interaction area. I would design large distributed systems significantly differently (be much more excited about extra tiers in my stack) if the standard RPC system offered say 10 µs typical round trip latency.

[1] https://static.googleusercontent.com/media/sre.google/en//st...

[2] https://gist.github.com/jboner/2841832

replies(3): >>39445409 #>>39445433 #>>39446206 #

2. dekhn ◴[20 Feb 24 19:03 UTC] No.39445409[source]▶

>>39444952 (TP) #

Modern data center networks don't have full cross connectivity. Instead they are built using graphs and hierarchies that provide less than the total bandwidth required for all pairs of hosts to be communicating. This means, as workloads start to grow and large numbers of compute hosts demand data IO to/from storage hosts, the network eventually gets congested, which typically exhibits as higher latencies and more dropped packets. Batch jobs are often relegated to "spare" bandwidth while serving jobs often get dedicated bandwidth

At the same time, ethernetworks with layered network protocols on top typically have a fair amount of latency overhead, that makes it much slower than bus-based direct-host-attached storage. I was definitely impressed at how quickly SSDs reached and then exceeded SATA bandwidth. nvme has made a HUGE difference here.

3. kccqzy ◴[20 Feb 24 19:05 UTC] No.39445433[source]▶

>>39444952 (TP) #

That document is probably deliberately on the pessimistic side to encourage your code to be portable across all kinds of "data centers" (however that is defined). When I previously worked at Google, the standard RPC system definitely offered 50 microseconds of round trip latency at the median (I measured it myself in a real application), and their advanced user-space implementation called Snap could offer about 10 microseconds of round trip latency. The latter figure comes from page 9 of https://storage.googleapis.com/gweb-research2023-media/pubto...

> nice datacenters may even have multiple 100 Gbit/sec NICs per machine in 2024,

Google exceeded 100Gbps per machine long before 2024. IIRC it had been 400Gbps for a while.

replies(3): >>39445578 #>>39445672 #>>39452641 #

4. scottlamb ◴[20 Feb 24 19:16 UTC] No.39445578[source]▶

>>39445433 #

Interesting. I worked at Google until January 2021. I see 2019 dates on that PDF, but I wasn't aware of snap when I left. There was some alternate RPC approach (Pony Express, maybe? I get the names mixed up) that claimed 10 µs or so but was advertised as experimental (iirc had some bad failure modes at the time in practice) and was simply unavailable in many of the datacenters I needed to deploy in. Maybe they're two names for the same thing. [edit: oh, yes, starting to actually read the paper now, and: "Through Snap, we created a new communication stack called Pony Express that implements a custom reliable transport and communications API."]

Actual latency with standard Stubby-over-TCP and warmed channels...it's been a while, so I don't remember the number I observed, but I remember it wasn't that much better than 0.5 ms. It was still bad enough that I didn't want to add a tier that would have helped with isolation in a particularly high-reliability system.

replies(1): >>39446740 #

5. Szpadel ◴[20 Feb 24 19:24 UTC] No.39445672[source]▶

>>39445433 #

with such speed and CXL gaining traction (think ram and GPUs over network) why network SSD is still issue? you could have like one storage server per rack that would serve storage only for that particular rack

you could easily have like 40GB/s with some over provisioning / bucketing

6. KaiserPro ◴[20 Feb 24 20:05 UTC] No.39446206[source]▶

>>39444952 (TP) #

Networks are not reliable, despite what you hear, so latency is used to mask re-tries and delays.

The other thing to note about big inter-DC links are heavily QoS'd and contented, because they are both expensive and a bollock to maintain.

Also, from what I recall, 40gig links are just parallel 10 gig links, so have no lower latency. I'm not sure if 100/400 gigs are ten/fourty lines of ten gigs in parallel or actually able to issue packets at 10/40 times a ten gig link. I've been away from networking too long

replies(2): >>39446231 #>>39446971 #

7. scottlamb ◴[20 Feb 24 20:08 UTC] No.39446231[source]▶

>>39446206 #

> Networks are not reliable, despite what you hear, so latency is used to mask re-tries and delays.

Of course, but even the 50%ile case is strangely slow, and if that involves retries something is deeply wrong.

replies(1): >>39446331 #

8. KaiserPro ◴[20 Feb 24 20:17 UTC] No.39446331{3}[source]▶

>>39446231 #

You're right, but TCP doesn't like packets being dropped halfway through a stream. If you have a highly QoS'd link then you'll see latency spikes.

replies(1): >>39446582 #

9. scottlamb ◴[20 Feb 24 20:38 UTC] No.39446582{4}[source]▶

>>39446331 #

Again, I'm not talking about spikes (though better tail latency is always desirable) but poor latency in the 50%ile case. And for high-QoS applications, not batch stuff. The snap paper linked elsewhere in the thread shows 10 µs latencies; they've put in some optimization to achieve that, but I don't really understand why we don't expect close to that with standard kernel networking and TCP.

replies(1): >>39448620 #

10. kccqzy ◴[20 Feb 24 20:51 UTC] No.39446740{3}[source]▶

>>39445578 #

Snap was the external name for the internal project known as User Space Packet Service (abbreviated USPS) so naturally they renamed it prior to publication. I deployed an app using Pony Express in 2023 and it was available in the majority of cells worldwide. Pony Express supported more than just RPC though. The alternate RPC approach that you spoke of was called Void. It had been experimental for a long time and indeed it wasn't well known even inside Google.

> but I remember it wasn't that much better than 0.5 ms.

If you and I still worked at Google I'd just give you an automon dashboard link showing latency an order of magnitude better than that to prove myself…

replies(1): >>39447211 #

11. wmf ◴[20 Feb 24 21:12 UTC] No.39446971[source]▶

>>39446206 #

40gig links are just parallel 10 gig links, so have no lower latency

That's not correct. Higher link speeds do have lower serialization latency, although that's a small fraction of overall network latency.

12. scottlamb ◴[20 Feb 24 21:33 UTC] No.39447211{4}[source]▶

>>39446740 #

Interesting, thanks!

> If you and I still worked at Google I'd just give you an automon dashboard link showing latency an order of magnitude better than that to prove myself…

I believe you, and I think in principle we should all be getting the 50 µs latency you're describing within a datacenter with no special effort.

...but it doesn't match what I observed, and I'm not sure why. Maybe difference of a couple years. Maybe I was checking somewhere with older equipment, or some important config difference in our tests. And obviously my memory's a bit fuzzy by now but I know I didn't like the result I got.

13. vitus ◴[21 Feb 24 00:06 UTC] No.39448620{5}[source]▶

>>39446582 #

> The snap paper linked elsewhere in the thread shows 10 µs latencies; they've put in some optimization to achieve that, but I don't really understand why we don't expect close to that with standard kernel networking and TCP.

You can get similar results by looking at comparisons between DPDK and kernel networking. Most of the usual gap comes from not needing to context-switch for kernel interrupt handling, zero-copy abstractions, and busy polling (wherein you trade CPU for lower latency instead of sleeping between iterations if there's no work to be done).

https://talawah.io/blog/linux-kernel-vs-dpdk-http-performanc... goes into some amount of detail comparing request throughput of an unoptimized kernel networking stack, optimized kernel networking stack, and DPDK. I'm not aware of any benchmarks (public or private) comparing Snap vs DPDK vs Linux, so that's probably as close as you'll get.

replies(2): >>39450493 #>>39457390 #

14. CoolCold ◴[21 Feb 24 05:35 UTC] No.39450493{6}[source]▶

>>39448620 #

Great reading, thanks for the link on vanilla vs dpdk

15. Nemo_bis ◴[21 Feb 24 11:44 UTC] No.39452641[source]▶

>>39445433 #

50 microseconds is a lot. I'm looking at disk read latency on a bunch of baremetal servers (nothing fancy, just node_disk_read.* metrics from node-exporter) and one of the slowest fleets has a median disk read latency barely above 1 microsecond. (And that's with rather slow HDDs.)

replies(1): >>39453008 #

16. namibj ◴[21 Feb 24 12:32 UTC] No.39453008{3}[source]▶

>>39452641 #

Samsung 980 Pro SSDs, the early generation ones (they seem to later have replaced them with a different, likely worse architecture), have average latency 4k reads of 30~70~120 microseconds for single queue/70% of the IOPS the 120us gets you/maximum parallelism before latency goes through the roof.

The metrics you mention have to be pagecache hits. Basically all MLC NAND is in the double digit microseconds for uncontended random reads.

replies(1): >>39453391 #

17. Nemo_bis ◴[21 Feb 24 13:16 UTC] No.39453391{4}[source]▶

>>39453008 #

They likely are cache hits, indeed (any suggestion what other metrics would be more comparable?). Still, at the end of the day I don't care whether a disk operation was made fast by kernel caching or by some other optimization at a lower level, I only care about the final result. With public cloud virtualization there are more layers where something may go wrong, and good luck finding answers from Amazon or Microsoft if your performance turns out to be abysmal.

replies(1): >>39458714 #

18. scottlamb ◴[21 Feb 24 18:18 UTC] No.39457390{6}[source]▶

>>39448620 #

Thanks for the link. How does this compare to the analogous situation for SSD access? I know there are also userspace IO stacks for similar reasons, but it seems like SSD-via-kernel is way ahead of network-via-kernel in the sense that it adds less latency per operation over the best userspace stack.

replies(1): >>39461329 #

19. ◴[21 Feb 24 20:00 UTC] No.39458714{5}[source]▶

>>39453391 #

20. vitus ◴[21 Feb 24 23:56 UTC] No.39461329{7}[source]▶

>>39457390 #

I can't say offhand, since I don't have much experience with storage.

I'd expect that most of the work from a SSD read is offloaded to the disk controller, which presumably uses DMA, and you don't have nearly as many round trips (a sequential read can be done with a single SCSI command).

I'm inclined to agree with the explanation given by other commenters that the limiting factor for SSD r/w speeds in the cloud is due to throttling in the hypervisor to provide users with predictable performance as well as isolation in a multitenant environment.

↑