Most active commenters
  • bombcar(3)

←back to thread

SSDs have become fast, except in the cloud

(databasearchitects.blogspot.com)
589 points greghn | 22 comments | | HN request time: 0.472s | source | bottom
Show context
pclmulqdq ◴[] No.39443994[source]
This was a huge technical problem I worked on at Google, and is sort of fundamental to a cloud. I believe this is actually a big deal that drives peoples' technology directions.

SSDs in the cloud are attached over a network, and fundamentally have to be. The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

replies(30): >>39444009 #>>39444024 #>>39444028 #>>39444046 #>>39444062 #>>39444085 #>>39444096 #>>39444099 #>>39444120 #>>39444138 #>>39444328 #>>39444374 #>>39444396 #>>39444429 #>>39444655 #>>39444952 #>>39445035 #>>39445917 #>>39446161 #>>39446248 #>>39447169 #>>39447467 #>>39449080 #>>39449287 #>>39449377 #>>39449994 #>>39450169 #>>39450172 #>>39451330 #>>39466088 #
1. _Rabs_ ◴[] No.39444028[source]
So much of this. The amount of times I've seen someone complain about slow DB performance when they're trying to connect to it from a different VPC, and bottlenecking themselves to 100Mbits is stupidly high.

Literally depending on where things are in a data center... If you're looking for closely coupled and on a 10G line on the same switch, going to the same server rack. I bet you performance will be so much more consistent.

replies(3): >>39444090 #>>39444438 #>>39505345 #
2. bugbuddy ◴[] No.39444090[source]
Aren’t 10G and 100G connections standard nowadays in data centers? Heck, I thought they were standard 10 years ago.
replies(4): >>39444293 #>>39444309 #>>39444315 #>>39446155 #
3. geerlingguy ◴[] No.39444293[source]
Datacenters are up to 400 Gbps and beyond (many places are adopting 1+ Tbps on core switching).

However, individual servers may still operate at 10, 25, or 40 Gbps to save cost on the thousands of NICs in a row of racks. Alternatively, servers with multiple 100G connections split that bandwidth allocation up among dozens of VMs so each one gets 1 or 10G.

4. nixass ◴[] No.39444309[source]
400G is fairly normal thing in DCs nowadays
5. pixl97 ◴[] No.39444315[source]
Bandwidth delay product does not help serialized transactions. If you're reaching out to disk for results, or if you have locking transactions on a table the achievable operations drops dramatically as latency between the host and the disk increases.
replies(1): >>39444997 #
6. silverquiet ◴[] No.39444438[source]
> Literally depending on where things are in a data center

I thought cloud was supposed to abstract this away? That's a bit of a sarcastic question from a long-time cloud skeptic, but... wasn't it?

replies(3): >>39444488 #>>39445334 #>>39446736 #
7. doubled112 ◴[] No.39444488[source]
Reality always beats the abstraction. After all, it's just somebody else's computer in somebody else's data center.
replies(1): >>39444553 #
8. bombcar ◴[] No.39444553{3}[source]
Which can cause considerable "amusement" depending on the provider - one I won't name directly but is much more centered on actual renting racks than their (now) cloud offering - if you had a virtual machine older than a year or so, deleting and restoring it would get you on a newer "host" and you'd be faster for the same cost.

Otherwise it'd stay on the same physical piece of hardware it was allocated to when new.

replies(1): >>39444620 #
9. doubled112 ◴[] No.39444620{4}[source]
Amusing is a good description.

"Hardware degradation detected, please turn it off and back on again"

I could do a migration with zero downtime in VMware for a decade but they can't seamlessly move my VM to a machine that works in 2024? Great, thanks. Amusing.

replies(2): >>39445263 #>>39445751 #
10. bee_rider ◴[] No.39444997{3}[source]
The typical way to trade bandwidth away for latency would, I guess, be speculative requests. In the CPU world at least. I wonder if any cloud providers have some sort of framework built around speculative disk reads (or maybe it is a totally crazy trade to make in this context)?
replies(3): >>39446077 #>>39446671 #>>39448491 #
11. bombcar ◴[] No.39445263{5}[source]
I have always been incredibly saddened that apparently the cloud providers usually have nothing as advanced as old VMware was.
12. kccqzy ◴[] No.39445334[source]
It's more of a matter of adding additional abstraction layers. For example in most public clouds the best you can hope for is to place two things in the same availability zone to get the best performance. But when I worked at Google, internally they had more sophisticated colocation constraint than that: for example you can require two things to be on the same rack.
13. wmf ◴[] No.39445751{5}[source]
Cloud providers have live migration now but I guess they don't want to guarantee anything.
replies(1): >>39447261 #
14. pixl97 ◴[] No.39446077{4}[source]
I mean we already have readahead in the kernel.

This said the problem can get more complex than this really fast. Write barriers for example and dirty caches. Any application that forces writes and the writes are enforced by the kernel are going to suffer.

The same is true for SSD settings. There are a number of tweakable values on SSDs when it comes to write commit and cache usage which can affect performance. Desktop OS's tend to play more fast and loose with these settings and servers defaults tend to be more conservative.

15. KaiserPro ◴[] No.39446155[source]
Yes, but you have to think about contention. Whilst the Top of rack might have 2x400 gig links to the core, thats shared with the entire rack, and all the other machines trying to shout at the core switching infra.

Then stuff goes away, or route congested, etc, etc, etc.

16. treflop ◴[] No.39446671{4}[source]
Often times it’s the app (or something high level) that would need speculative requests, which it may not be possible in the given domain.

I don’t think it’s possible in most domains.

17. treflop ◴[] No.39446736[source]
Cloud makes provisioning more servers quicker because you are paying someone to basically have a bunch of servers ready to go right away with an API call instead of a phone call, maintained by a team that isn’t yours, with economies of scale working for the provider.

Cloud does not do anything else.

None of these latency/speed problems are cloud-specific. If you have on-premise servers and you are storing your data on network-attached storage, you have the exact same problems (and also the same advantages).

Unfortunately the gap between local and network storage is wide. You win some, you lose some.

replies(1): >>39447952 #
18. bombcar ◴[] No.39447261{6}[source]
It's better (and better still with other providers) but I naively thought that "add more RAM" or "add more disk" was something they would be able to do with a reboot at most.

Nope, some require a full backup and restore.

replies(1): >>39447719 #
19. wmf ◴[] No.39447719{7}[source]
Resizing VMs doesn't really fit the "cattle" thinking of public cloud, although IMO that was kind of a premature optimization. This would be a perfect use case for live migration.
20. silverquiet ◴[] No.39447952{3}[source]
Oh, I'm not a complete neophyte (in what seems like a different life now, I worked for a big hosting provider actually), I was just surprised that there was a big penalty for cross-VPC traffic implied by the parent poster.
21. Nextgrid ◴[] No.39448491{4}[source]
You'd need the whole stack to understand your data format in order to make speculative requests useful. It wouldn't surprise me if cloud providers indeed do speculative reads but there isn't much they can do to understand your data format, so chances are they're just reading a few extra blocks beyond where your OS read and are hoping that the next OS-initiated read will fall there so it can be serviced using this prefetched data. Because of full-disk-encryption, the storage stack may not be privy to the actual data so it couldn't make smarter, data-aware decisions even if it wanted to, limiting it to primitive readahead or maybe statistics based on previously-seen patterns (if it sees that a request for block X is often followed by block Y, it may choose to prefetch that next time it sees block X accessed).

A problem in applications such as databases is when the outcome of an IO operation is required to initiate the next one - for example, you must first read an index to know the on-disk location of the actual row data. This is where the higher latency absolutely tanks performance.

A solution could be to make the storage drives smarter - have an NVME command that could say like "search in between this range for this byte pattern" and one that can say "use the outcome of the previous command as a the start address and read N bytes from there". This could help speed up the aforementioned scenario (effectively your drive will do the index scan & row retrieval for you), but would require cooperation between the application, the filesystem and the encryption system (typical, current FDE would break this).

22. redwood ◴[] No.39505345[source]
Your comment implies that a network hop between two VPCs is inherently slow. My understanding is the VPCs are akin to a network encryption isolation boundary but should not meaningfully slow down network transfers between each other..

I think you may be conflating the fact that across two VPCs you may be slightly more likely to be doing a cross availability zone or potentially even cross region network hop? I just think it's important to be on the pulse of what's really going on here