←back to thread

SSDs have become fast, except in the cloud

(databasearchitects.blogspot.com)
589 points greghn | 4 comments | | HN request time: 0.818s | source
Show context
pclmulqdq ◴[] No.39443994[source]
This was a huge technical problem I worked on at Google, and is sort of fundamental to a cloud. I believe this is actually a big deal that drives peoples' technology directions.

SSDs in the cloud are attached over a network, and fundamentally have to be. The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

replies(30): >>39444009 #>>39444024 #>>39444028 #>>39444046 #>>39444062 #>>39444085 #>>39444096 #>>39444099 #>>39444120 #>>39444138 #>>39444328 #>>39444374 #>>39444396 #>>39444429 #>>39444655 #>>39444952 #>>39445035 #>>39445917 #>>39446161 #>>39446248 #>>39447169 #>>39447467 #>>39449080 #>>39449287 #>>39449377 #>>39449994 #>>39450169 #>>39450172 #>>39451330 #>>39466088 #
scottlamb ◴[] No.39444952[source]
> The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

Certainly true that SSD bandwidth and latency improvements are hard to match, but I don't understand why intra-datacenter network latency in particular is so bad. This ~2020-I-think version of the "Latency Numbers Everyone Should Know" says 0.5 ms round trip (and mentions "10 Gbps network" on another line). [1] It was the same thing in a 2012 version (that only mentions "1 Gbps network"). [2] Why no improvement? I think that 2020 version might have been a bit conservative on this line, and nice datacenters may even have multiple 100 Gbit/sec NICs per machine in 2024, but still I think the round trip actually is strangely bad.

I've seen experimental networking stuff (e.g. RDMA) that claims significantly better latency, so I don't think it's a physical limitation of the networking gear but rather something at the machine/OS interaction area. I would design large distributed systems significantly differently (be much more excited about extra tiers in my stack) if the standard RPC system offered say 10 µs typical round trip latency.

[1] https://static.googleusercontent.com/media/sre.google/en//st...

[2] https://gist.github.com/jboner/2841832

replies(3): >>39445409 #>>39445433 #>>39446206 #
kccqzy ◴[] No.39445433[source]
That document is probably deliberately on the pessimistic side to encourage your code to be portable across all kinds of "data centers" (however that is defined). When I previously worked at Google, the standard RPC system definitely offered 50 microseconds of round trip latency at the median (I measured it myself in a real application), and their advanced user-space implementation called Snap could offer about 10 microseconds of round trip latency. The latter figure comes from page 9 of https://storage.googleapis.com/gweb-research2023-media/pubto...

> nice datacenters may even have multiple 100 Gbit/sec NICs per machine in 2024,

Google exceeded 100Gbps per machine long before 2024. IIRC it had been 400Gbps for a while.

replies(3): >>39445578 #>>39445672 #>>39452641 #
1. Nemo_bis ◴[] No.39452641[source]
50 microseconds is a lot. I'm looking at disk read latency on a bunch of baremetal servers (nothing fancy, just node_disk_read.* metrics from node-exporter) and one of the slowest fleets has a median disk read latency barely above 1 microsecond. (And that's with rather slow HDDs.)
replies(1): >>39453008 #
2. namibj ◴[] No.39453008[source]
Samsung 980 Pro SSDs, the early generation ones (they seem to later have replaced them with a different, likely worse architecture), have average latency 4k reads of 30~70~120 microseconds for single queue/70% of the IOPS the 120us gets you/maximum parallelism before latency goes through the roof.

The metrics you mention have to be pagecache hits. Basically all MLC NAND is in the double digit microseconds for uncontended random reads.

replies(1): >>39453391 #
3. Nemo_bis ◴[] No.39453391[source]
They likely are cache hits, indeed (any suggestion what other metrics would be more comparable?). Still, at the end of the day I don't care whether a disk operation was made fast by kernel caching or by some other optimization at a lower level, I only care about the final result. With public cloud virtualization there are more layers where something may go wrong, and good luck finding answers from Amazon or Microsoft if your performance turns out to be abysmal.
replies(1): >>39458714 #
4. ◴[] No.39458714{3}[source]