←back to thread

SSDs have become fast, except in the cloud

(databasearchitects.blogspot.com)
589 points greghn | 9 comments | | HN request time: 0.206s | source | bottom
Show context
pclmulqdq ◴[] No.39443994[source]
This was a huge technical problem I worked on at Google, and is sort of fundamental to a cloud. I believe this is actually a big deal that drives peoples' technology directions.

SSDs in the cloud are attached over a network, and fundamentally have to be. The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

replies(30): >>39444009 #>>39444024 #>>39444028 #>>39444046 #>>39444062 #>>39444085 #>>39444096 #>>39444099 #>>39444120 #>>39444138 #>>39444328 #>>39444374 #>>39444396 #>>39444429 #>>39444655 #>>39444952 #>>39445035 #>>39445917 #>>39446161 #>>39446248 #>>39447169 #>>39447467 #>>39449080 #>>39449287 #>>39449377 #>>39449994 #>>39450169 #>>39450172 #>>39451330 #>>39466088 #
scottlamb ◴[] No.39444952[source]
> The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

Certainly true that SSD bandwidth and latency improvements are hard to match, but I don't understand why intra-datacenter network latency in particular is so bad. This ~2020-I-think version of the "Latency Numbers Everyone Should Know" says 0.5 ms round trip (and mentions "10 Gbps network" on another line). [1] It was the same thing in a 2012 version (that only mentions "1 Gbps network"). [2] Why no improvement? I think that 2020 version might have been a bit conservative on this line, and nice datacenters may even have multiple 100 Gbit/sec NICs per machine in 2024, but still I think the round trip actually is strangely bad.

I've seen experimental networking stuff (e.g. RDMA) that claims significantly better latency, so I don't think it's a physical limitation of the networking gear but rather something at the machine/OS interaction area. I would design large distributed systems significantly differently (be much more excited about extra tiers in my stack) if the standard RPC system offered say 10 µs typical round trip latency.

[1] https://static.googleusercontent.com/media/sre.google/en//st...

[2] https://gist.github.com/jboner/2841832

replies(3): >>39445409 #>>39445433 #>>39446206 #
1. KaiserPro ◴[] No.39446206[source]
Networks are not reliable, despite what you hear, so latency is used to mask re-tries and delays.

The other thing to note about big inter-DC links are heavily QoS'd and contented, because they are both expensive and a bollock to maintain.

Also, from what I recall, 40gig links are just parallel 10 gig links, so have no lower latency. I'm not sure if 100/400 gigs are ten/fourty lines of ten gigs in parallel or actually able to issue packets at 10/40 times a ten gig link. I've been away from networking too long

replies(2): >>39446231 #>>39446971 #
2. scottlamb ◴[] No.39446231[source]
> Networks are not reliable, despite what you hear, so latency is used to mask re-tries and delays.

Of course, but even the 50%ile case is strangely slow, and if that involves retries something is deeply wrong.

replies(1): >>39446331 #
3. KaiserPro ◴[] No.39446331[source]
You're right, but TCP doesn't like packets being dropped halfway through a stream. If you have a highly QoS'd link then you'll see latency spikes.
replies(1): >>39446582 #
4. scottlamb ◴[] No.39446582{3}[source]
Again, I'm not talking about spikes (though better tail latency is always desirable) but poor latency in the 50%ile case. And for high-QoS applications, not batch stuff. The snap paper linked elsewhere in the thread shows 10 µs latencies; they've put in some optimization to achieve that, but I don't really understand why we don't expect close to that with standard kernel networking and TCP.
replies(1): >>39448620 #
5. wmf ◴[] No.39446971[source]
40gig links are just parallel 10 gig links, so have no lower latency

That's not correct. Higher link speeds do have lower serialization latency, although that's a small fraction of overall network latency.

6. vitus ◴[] No.39448620{4}[source]
> The snap paper linked elsewhere in the thread shows 10 µs latencies; they've put in some optimization to achieve that, but I don't really understand why we don't expect close to that with standard kernel networking and TCP.

You can get similar results by looking at comparisons between DPDK and kernel networking. Most of the usual gap comes from not needing to context-switch for kernel interrupt handling, zero-copy abstractions, and busy polling (wherein you trade CPU for lower latency instead of sleeping between iterations if there's no work to be done).

https://talawah.io/blog/linux-kernel-vs-dpdk-http-performanc... goes into some amount of detail comparing request throughput of an unoptimized kernel networking stack, optimized kernel networking stack, and DPDK. I'm not aware of any benchmarks (public or private) comparing Snap vs DPDK vs Linux, so that's probably as close as you'll get.

replies(2): >>39450493 #>>39457390 #
7. CoolCold ◴[] No.39450493{5}[source]
Great reading, thanks for the link on vanilla vs dpdk
8. scottlamb ◴[] No.39457390{5}[source]
Thanks for the link. How does this compare to the analogous situation for SSD access? I know there are also userspace IO stacks for similar reasons, but it seems like SSD-via-kernel is way ahead of network-via-kernel in the sense that it adds less latency per operation over the best userspace stack.
replies(1): >>39461329 #
9. vitus ◴[] No.39461329{6}[source]
I can't say offhand, since I don't have much experience with storage.

I'd expect that most of the work from a SSD read is offloaded to the disk controller, which presumably uses DMA, and you don't have nearly as many round trips (a sequential read can be done with a single SCSI command).

I'm inclined to agree with the explanation given by other commenters that the limiting factor for SSD r/w speeds in the cloud is due to throttling in the hypervisor to provide users with predictable performance as well as isolation in a multitenant environment.