SSDs have become fast, except in the cloud

(databasearchitects.blogspot.com)

589 points greghn | 5 comments | 20 Feb 24 16:59 UTC | HN request time: 1.256s | source

Show context

pclmulqdq ◴[20 Feb 24 17:22 UTC] No.39443994[source]▶

This was a huge technical problem I worked on at Google, and is sort of fundamental to a cloud. I believe this is actually a big deal that drives peoples' technology directions.

SSDs in the cloud are attached over a network, and fundamentally have to be. The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

replies(30): >>39444009 #>>39444024 #>>39444028 #>>39444046 #>>39444062 #>>39444085 #>>39444096 #>>39444099 #>>39444120 #>>39444138 #>>39444328 #>>39444374 #>>39444396 #>>39444429 #>>39444655 #>>39444952 #>>39445035 #>>39445917 #>>39446161 #>>39446248 #>>39447169 #>>39447467 #>>39449080 #>>39449287 #>>39449377 #>>39449994 #>>39450169 #>>39450172 #>>39451330 #>>39466088 #

vlovich123 ◴[20 Feb 24 17:24 UTC] No.39444024[source]▶

>>39443994 #

Why do they fundamentally need to be network attached storage instead of local to the VM?

replies(5): >>39444042 #>>39444055 #>>39444065 #>>39444132 #>>39444197 #

Filligree ◴[20 Feb 24 17:26 UTC] No.39444042[source]▶

>>39444024 #

They don't. Some cloud providers (i.e. Hetzner) let you rent VMs with locally attached NVMe, which is dramatically faster than network-attached even factoring in the VM tax.

Of course then you have a single point of failure, in the PCIe fabric of the machine you're running on if not the NVMe itself. But if you have good backups, which you should, then the juice really isn't worth the squeeze for NAS storage.

replies(1): >>39444103 #

ssl-3 ◴[20 Feb 24 17:31 UTC] No.39444103[source]▶

>>39444042 #

A network adds more points of failure. It does not reduce them.

replies(2): >>39444177 #>>39444445 #

supriyo-biswas ◴[20 Feb 24 17:36 UTC] No.39444177[source]▶

>>39444103 #

A network attached, replicated storage hedges against data loss but increases latency; however most customers usually prefer higher latency to data loss. As an example, see the highly upvoted fly.io thread[1] with customers complaining about the same thing.

[1] https://news.ycombinator.com/item?id=36808296

replies(1): >>39444649 #

ssl-3 ◴[20 Feb 24 18:12 UTC] No.39444649[source]▶

>>39444177 #

Locally-attached, replicated storage also hedges against data loss.

replies(1): >>39444814 #

1. supriyo-biswas ◴[20 Feb 24 18:25 UTC] No.39444814[source]▶

>>39444649 #

RAID rebuild times make it an unviable option and customers typically expect problematic VMs to be live-migrated to other hosts with the disks still having their intended data.

The self hosted version of this is GlusterFS and Ceph, which have the same dynamics as EBS and its equivalents in other cloud providers.

replies(1): >>39445429 #

2. mike_hearn ◴[20 Feb 24 19:05 UTC] No.39445429[source]▶

>>39444814 (TP) #

With NVMe SSDs? What makes RAID unviable in that environment?

replies(1): >>39446043 #

3. dijit ◴[20 Feb 24 19:51 UTC] No.39446043[source]▶

>>39445429 #

This depends, like all things.

When you say RAID, what level? Software-raid or hardware raid? What controller?

Let's take best-case:

RAID10, small enough (but many) NVMe drives and an LVM/Software RAID like ZFS, which is data aware so only rebuilds actual data: rebuilds will degrade performance enough potentially that your application can become unavailable if your IOPS are 70%+ of maximum.

That's an ideal scenario, if you use hardware raid which is not data-aware then your rebuild times depend entirely on the size of the drive being rebuilt and it can punish IOPs even more during the rebuild. But it will affect your CPU less.

There's no panacea. Most people opt for higher latency distributed storage where the RAID is spread across an enormous amount of drives, which makes rebuilds much less painful.

replies(1): >>39450519 #

4. timc3 ◴[21 Feb 24 05:40 UTC] No.39450519{3}[source]▶

>>39446043 #

What I used to do was swap machines over from the one with failing disks to a live spare (slave in the old frowned upon terminology), do the maintenance and then replicate from the now live spare back if I had confidence it was all good.

Yes it’s costly having the hardware to do that as it mostly meant multiple machines as I always wanted to be able to rebuild one whilst having at least two machines online.

replies(1): >>39450939 #

5. dijit ◴[21 Feb 24 07:06 UTC] No.39450939{4}[source]▶

>>39450519 #

If you are doing this with your own hardware it is still less costly than cloud even if it mostly sits idle.

Cloud is approx 5x sticker cost for compute if its sustained.

Your discounts may vary, rue the day those discounts are taken away because we are all sufficiently locked in.

↑