Most active commenters

Popular/hot comments

>>39444132 #

←back to thread

SSDs have become fast, except in the cloud

(databasearchitects.blogspot.com)

Show context

pclmulqdq ◴[20 Feb 24 17:22 UTC] No.39443994[source]▶

>>39443679 (OP) #

This was a huge technical problem I worked on at Google, and is sort of fundamental to a cloud. I believe this is actually a big deal that drives peoples' technology directions.

SSDs in the cloud are attached over a network, and fundamentally have to be. The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

replies(30): >>39444009 #>>39444024 #>>39444028 #>>39444046 #>>39444062 #>>39444085 #>>39444096 #>>39444099 #>>39444120 #>>39444138 #>>39444328 #>>39444374 #>>39444396 #>>39444429 #>>39444655 #>>39444952 #>>39445035 #>>39445917 #>>39446161 #>>39446248 #>>39447169 #>>39447467 #>>39449080 #>>39449287 #>>39449377 #>>39449994 #>>39450169 #>>39450172 #>>39451330 #>>39466088 #

1. vlovich123 ◴[20 Feb 24 17:24 UTC] No.39444024[source]▶

>>39443994 #

Why do they fundamentally need to be network attached storage instead of local to the VM?

replies(5): >>39444042 #>>39444055 #>>39444065 #>>39444132 #>>39444197 #

2. Filligree ◴[20 Feb 24 17:26 UTC] No.39444042[source]▶

>>39444024 (TP) #

They don't. Some cloud providers (i.e. Hetzner) let you rent VMs with locally attached NVMe, which is dramatically faster than network-attached even factoring in the VM tax.

Of course then you have a single point of failure, in the PCIe fabric of the machine you're running on if not the NVMe itself. But if you have good backups, which you should, then the juice really isn't worth the squeeze for NAS storage.

replies(1): >>39444103 #

3. Retric ◴[20 Feb 24 17:27 UTC] No.39444055[source]▶

>>39444024 (TP) #

Redundancy, local storage is a single point of failure.

You can use local SSD’s as slow RAM, but anything on it can go away at any moment.

replies(1): >>39444944 #

4. pclmulqdq ◴[20 Feb 24 17:28 UTC] No.39444065[source]▶

>>39444024 (TP) #

Reliability. SSDs break and screw up a lot more frequently and more quickly than CPUs. Amazon has published a lot on the architecture of EBS, and they go through a good analysis of this. If you have a broken disk and you locally attach, you have a broken machine.

RAID helps you locally, but fundamentally relies on locality and low latency (and maybe custom hardware) to minimize the time window where you get true data corruption on a bad disk. That is insufficient for cloud storage.

replies(1): >>39450096 #

5. ssl-3 ◴[20 Feb 24 17:31 UTC] No.39444103[source]▶

>>39444042 #

A network adds more points of failure. It does not reduce them.

replies(2): >>39444177 #>>39444445 #

6. SteveNuts ◴[20 Feb 24 17:33 UTC] No.39444132[source]▶

>>39444024 (TP) #

Because even if you can squeeze 100TB or more of SSD/NVMe in a server, and there are 10 tenants using the machine, you're limited to 10TB as a hard ceiling.

What happens when one tenant needs 200TB attached to a server?

Cloud providers are starting to offer local SSD/NVMe, but you're renting the entire machine, and you're still limited to exactly what's installed in that server.

replies(3): >>39444256 #>>39444774 #>>39446160 #

7. supriyo-biswas ◴[20 Feb 24 17:36 UTC] No.39444177{3}[source]▶

>>39444103 #

A network attached, replicated storage hedges against data loss but increases latency; however most customers usually prefer higher latency to data loss. As an example, see the highly upvoted fly.io thread[1] with customers complaining about the same thing.

[1] https://news.ycombinator.com/item?id=36808296

replies(1): >>39444649 #

8. drewda ◴[20 Feb 24 17:37 UTC] No.39444197[source]▶

>>39444024 (TP) #

The major clouds do offer VMs with fast local storage, such as SSDs connected by NVMe connections directly to the VM host machine:

- https://cloud.google.com/compute/docs/disks/local-ssd

- https://learn.microsoft.com/en-us/azure/virtual-machines/ena...

- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-inst...

They sell these VMs at a higher cost because it requires more expensive components and is limited to host machines with certain configurations. In our experience, it's also harder to request quota increases to get more of these VMs -- some of the public clouds have a limited supply of these specific types of configurations in some regions/zones.

As others have noted, instance storage isn't as dependable. But it can be the most performant way to do IO-intense processing or to power one node of a distributed database.

9. jalk ◴[20 Feb 24 17:41 UTC] No.39444256[source]▶

>>39444132 #

How is that different from how cores, mem and network bandwidth is allotted to tenants?

replies(2): >>39444337 #>>39444566 #

10. pixl97 ◴[20 Feb 24 17:49 UTC] No.39444337{3}[source]▶

>>39444256 #

Because a fair number of customers spin up another image when cores/mem/bandwidth run low. Dedicated storage breaks that paradigm.

Also, adding, if I am on an 8 core machine and need 16, network storage can be detached from host A and connected to host B. In dedicated storage it must be fully copied over first.

11. crazygringo ◴[20 Feb 24 17:56 UTC] No.39444445{3}[source]▶

>>39444103 #

A network adds more points of failures but also reduces user-facing failures overall when properly architected.

If one CPU attached to storage dies, another can take over and reattach -- or vice-versa. If one network link dies, it can be rerouted around.

replies(1): >>39444568 #

12. baq ◴[20 Feb 24 18:06 UTC] No.39444566{3}[source]▶

>>39444256 #

It isn't. You could ask for network-attached CPUs or RAM. You'd be the only one, though, so in practice only network-attached storage makes sense business-wise. It also makes sense if you need to provision larger-than-usual amounts like tens of TB - these are usually hard to come by in a single server, but quite mundane for storage appliances.

13. bombcar ◴[20 Feb 24 18:06 UTC] No.39444568{4}[source]▶

>>39444445 #

Using a SAN (which is what networked storage is, after all) also lets you get various "tricks" such as snapshots, instant migration, etc for "free".

14. ssl-3 ◴[20 Feb 24 18:12 UTC] No.39444649{4}[source]▶

>>39444177 #

Locally-attached, replicated storage also hedges against data loss.

replies(1): >>39444814 #

15. vel0city ◴[20 Feb 24 18:22 UTC] No.39444774[source]▶

>>39444132 #

Given AWS and GCP offer multiple sizes for the same processor version with local SSDs, I don't think you have to rent the entire machine.

Search for i3en API names and you'll see:

i3en.large, 2x CPU, 1250GB SSD

i3en.xlarge, 4x CPU, 2500GB SSD

i3en.2xlarge, 8x CPU, 2x2500GB SSD

i3en.3xlarge, 12x CPU, 7500GB SSD

i3en.6xlarge, 24x CPU, 2x7500GB SSD

i3en.12xlarge, 48x CPU, 4x7500GB SSD

i3en.24xlarge, 96x CPU, 8x7500GB SSD

i3en.metal, 96x CPU, 8x7500GB SSD

So they've got servers with 96 CPUs and 8x7500GB SSDs. You can get a slice of one, or you can get the whole one. All of these are the ratio of 625GB of local SSD per CPU core.

https://instances.vantage.sh/

On GCP you can get a 2-core N2 instance type and attach multiple local SSDs. I doubt they have many physical 2-core Xeons in their datacenters.

16. supriyo-biswas ◴[20 Feb 24 18:25 UTC] No.39444814{5}[source]▶

>>39444649 #

RAID rebuild times make it an unviable option and customers typically expect problematic VMs to be live-migrated to other hosts with the disks still having their intended data.

The self hosted version of this is GlusterFS and Ceph, which have the same dynamics as EBS and its equivalents in other cloud providers.

replies(1): >>39445429 #

17. cduzz ◴[20 Feb 24 18:33 UTC] No.39444944[source]▶

>>39444055 #

I've seen SANs get nuked by operator error or by environmental issues (overheated DC == SAN shuts itself down).

Distributed clusters of things can work just fine on ephemeral local storage (aka local storage). A kafka cluster or an opensearch cluster will be fine using instance local storage, for instance.

As with everything else.... "it depends"

replies(1): >>39445679 #

18. mike_hearn ◴[20 Feb 24 19:05 UTC] No.39445429{6}[source]▶

>>39444814 #

With NVMe SSDs? What makes RAID unviable in that environment?

replies(1): >>39446043 #

19. Retric ◴[20 Feb 24 19:24 UTC] No.39445679{3}[source]▶

>>39444944 #

Sure distributed clusters get back to network/workload limitations.

replies(1): >>39454543 #

20. dijit ◴[20 Feb 24 19:51 UTC] No.39446043{7}[source]▶

>>39445429 #

This depends, like all things.

When you say RAID, what level? Software-raid or hardware raid? What controller?

Let's take best-case:

RAID10, small enough (but many) NVMe drives and an LVM/Software RAID like ZFS, which is data aware so only rebuilds actual data: rebuilds will degrade performance enough potentially that your application can become unavailable if your IOPS are 70%+ of maximum.

That's an ideal scenario, if you use hardware raid which is not data-aware then your rebuild times depend entirely on the size of the drive being rebuilt and it can punish IOPs even more during the rebuild. But it will affect your CPU less.

There's no panacea. Most people opt for higher latency distributed storage where the RAID is spread across an enormous amount of drives, which makes rebuilds much less painful.

replies(1): >>39450519 #

21. taneq ◴[20 Feb 24 20:02 UTC] No.39446160[source]▶

>>39444132 #

> What happens when one tenant needs 200TB attached to a server?

Link to this mythical hosting service that expects far less than 200TB of data per client but just pulls a sad face and takes the extra cost on board when a client demands it. :D

22. vlovich123 ◴[21 Feb 24 04:15 UTC] No.39450096[source]▶

>>39444065 #

Sure, but there's plenty of software that's written to use distributed unreliable storage similar to how cloud providers write their own software (e.g. Kafka). I can understand if many applications are just need something like EBS that's durable but looks like a normal disk, but not so sure it's a fundamentally required abstraction.

23. timc3 ◴[21 Feb 24 05:40 UTC] No.39450519{8}[source]▶

>>39446043 #

What I used to do was swap machines over from the one with failing disks to a live spare (slave in the old frowned upon terminology), do the maintenance and then replicate from the now live spare back if I had confidence it was all good.

Yes it’s costly having the hardware to do that as it mostly meant multiple machines as I always wanted to be able to rebuild one whilst having at least two machines online.

replies(1): >>39450939 #

24. dijit ◴[21 Feb 24 07:06 UTC] No.39450939{9}[source]▶

>>39450519 #

If you are doing this with your own hardware it is still less costly than cloud even if it mostly sits idle.

Cloud is approx 5x sticker cost for compute if its sustained.

Your discounts may vary, rue the day those discounts are taken away because we are all sufficiently locked in.

25. cduzz ◴[21 Feb 24 14:53 UTC] No.39454543{4}[source]▶

>>39445679 #

These days it's likely that your SAN is actually just a cluster of commodity hardware where the disks/SSDs have custom firmware and some fancy block shoveling software.

replies(1): >>39455183 #

26. ◴[21 Feb 24 15:43 UTC] No.39455183{5}[source]▶

>>39454543 #

↑