SSDs have become fast, except in the cloud

(databasearchitects.blogspot.com)

589 points greghn | 1 comments | 20 Feb 24 16:59 UTC | HN request time: 0s | source

Show context

pclmulqdq ◴[20 Feb 24 17:22 UTC] No.39443994[source]▶

This was a huge technical problem I worked on at Google, and is sort of fundamental to a cloud. I believe this is actually a big deal that drives peoples' technology directions.

SSDs in the cloud are attached over a network, and fundamentally have to be. The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

replies(30): >>39444009 #>>39444024 #>>39444028 #>>39444046 #>>39444062 #>>39444085 #>>39444096 #>>39444099 #>>39444120 #>>39444138 #>>39444328 #>>39444374 #>>39444396 #>>39444429 #>>39444655 #>>39444952 #>>39445035 #>>39445917 #>>39446161 #>>39446248 #>>39447169 #>>39447467 #>>39449080 #>>39449287 #>>39449377 #>>39449994 #>>39450169 #>>39450172 #>>39451330 #>>39466088 #

scottlamb ◴[20 Feb 24 18:33 UTC] No.39444952[source]▶

>>39443994 #

> The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

Certainly true that SSD bandwidth and latency improvements are hard to match, but I don't understand why intra-datacenter network latency in particular is so bad. This ~2020-I-think version of the "Latency Numbers Everyone Should Know" says 0.5 ms round trip (and mentions "10 Gbps network" on another line). [1] It was the same thing in a 2012 version (that only mentions "1 Gbps network"). [2] Why no improvement? I think that 2020 version might have been a bit conservative on this line, and nice datacenters may even have multiple 100 Gbit/sec NICs per machine in 2024, but still I think the round trip actually is strangely bad.

I've seen experimental networking stuff (e.g. RDMA) that claims significantly better latency, so I don't think it's a physical limitation of the networking gear but rather something at the machine/OS interaction area. I would design large distributed systems significantly differently (be much more excited about extra tiers in my stack) if the standard RPC system offered say 10 µs typical round trip latency.

[1] https://static.googleusercontent.com/media/sre.google/en//st...

[2] https://gist.github.com/jboner/2841832

replies(3): >>39445409 #>>39445433 #>>39446206 #

kccqzy ◴[20 Feb 24 19:05 UTC] No.39445433[source]▶

>>39444952 #

That document is probably deliberately on the pessimistic side to encourage your code to be portable across all kinds of "data centers" (however that is defined). When I previously worked at Google, the standard RPC system definitely offered 50 microseconds of round trip latency at the median (I measured it myself in a real application), and their advanced user-space implementation called Snap could offer about 10 microseconds of round trip latency. The latter figure comes from page 9 of https://storage.googleapis.com/gweb-research2023-media/pubto...

> nice datacenters may even have multiple 100 Gbit/sec NICs per machine in 2024,

Google exceeded 100Gbps per machine long before 2024. IIRC it had been 400Gbps for a while.

replies(3): >>39445578 #>>39445672 #>>39452641 #

scottlamb ◴[20 Feb 24 19:16 UTC] No.39445578[source]▶

>>39445433 #

Interesting. I worked at Google until January 2021. I see 2019 dates on that PDF, but I wasn't aware of snap when I left. There was some alternate RPC approach (Pony Express, maybe? I get the names mixed up) that claimed 10 µs or so but was advertised as experimental (iirc had some bad failure modes at the time in practice) and was simply unavailable in many of the datacenters I needed to deploy in. Maybe they're two names for the same thing. [edit: oh, yes, starting to actually read the paper now, and: "Through Snap, we created a new communication stack called Pony Express that implements a custom reliable transport and communications API."]

Actual latency with standard Stubby-over-TCP and warmed channels...it's been a while, so I don't remember the number I observed, but I remember it wasn't that much better than 0.5 ms. It was still bad enough that I didn't want to add a tier that would have helped with isolation in a particularly high-reliability system.

replies(1): >>39446740 #

kccqzy ◴[20 Feb 24 20:51 UTC] No.39446740{3}[source]▶

>>39445578 #

Snap was the external name for the internal project known as User Space Packet Service (abbreviated USPS) so naturally they renamed it prior to publication. I deployed an app using Pony Express in 2023 and it was available in the majority of cells worldwide. Pony Express supported more than just RPC though. The alternate RPC approach that you spoke of was called Void. It had been experimental for a long time and indeed it wasn't well known even inside Google.

> but I remember it wasn't that much better than 0.5 ms.

If you and I still worked at Google I'd just give you an automon dashboard link showing latency an order of magnitude better than that to prove myself…

replies(1): >>39447211 #

1. scottlamb ◴[20 Feb 24 21:33 UTC] No.39447211{4}[source]▶

>>39446740 #

Interesting, thanks!

> If you and I still worked at Google I'd just give you an automon dashboard link showing latency an order of magnitude better than that to prove myself…

I believe you, and I think in principle we should all be getting the 50 µs latency you're describing within a datacenter with no special effort.

...but it doesn't match what I observed, and I'm not sure why. Maybe difference of a couple years. Maybe I was checking somewhere with older equipment, or some important config difference in our tests. And obviously my memory's a bit fuzzy by now but I know I didn't like the result I got.

↑