←back to thread

SSDs have become fast, except in the cloud

(databasearchitects.blogspot.com)
589 points greghn | 1 comments | | HN request time: 0s | source
Show context
c0l0 ◴[] No.39444187[source]
Seeing the really just puny "provisioned IOPS" numbers on hugely expensive cloud instances made me chuckle (first in disbelief, then in horror) when I joined a "cloud-first" enterprise shop in 2020 (having come from a company that hosted their own hardware at a colo).

It's no wonder that many people nowadays, esp. those who are so young that they've never experienced anything but cloud instances, seem to have little idea of how much performance you can actually pack in just one or two RUs today. Ultra-fast (I'm not parroting some marketing speak here - I just take a look at IOPS numbers, and compare them to those from highest-end storage some 10-12 years ago) NVMe storage is a big part of that astonishing magic.

replies(3): >>39448208 #>>39448367 #>>39449930 #
jauntywundrkind ◴[] No.39448367[source]
NVMe has been ridiculously great. I'm excited to see what happens to prices as E1 form factor ramps up! Much physically bigger drives allows for consolidation of parts, a higher ratio of flash chips to everything else, which seems promising. It's more a value line, but Intel's P5315 is 15TB at a quite low $0.9/GB.

It might not help much with oops though. Amazing that we have PCIe 5.0 16GB/s and already are so near theoretical max (some lost to overhead), even on consumer cards.

Going enterprise for the drive-writes-per-day (DWPD) is 100% worth it for most folks, but I am morbidly curious how different the performance profile would be running enterprise vs non these days. But reciprocally the high DWPD drives (Kioxia CD8P-V for example is DWPD of 3) seems to often come with somewhat more mild sustained 4k write oops, making me think maybe there's a speed vs reliability tradeoff that could be taken advantage of from consumer drives in some cases; not sure who wants tons of iops but doesn't actually intend to hit their Total Drive Writes, but it save you some iops/$ if so. That said, I'm shocked to see the enterprise premium is a lot less absurd than it used to be! (If you can find stock.)

replies(1): >>39449159 #
bcaxis ◴[] No.39449159[source]
The main problem with consumer drives is the missing power loss protection (plp). M.2 drives just don't have space for the caps like an enterprise 2.5 u.2/u.3 drive will have.

This matters when the DB calls a sync and it's expecting the data to be written safely to disk before it returns.

A consumer drive basically stops everything until it can report success and your IOPS falls to like 1/100th of what the drive is capable of if it's happening alot.

An enterprise drive with plp will just report success knowing it has the power to finish the pending writes. Full speed ahead.

You can "lie" to the process at the VPS level by enabling unsafe write back cache. You can do it at the OS level by launching the DB with "eatmydata". You will get the full performance of your SSD.

In the event of power loss you may well end up in an unrecoverable corrupted condition with these enabled.

I believe that if you buy all consumer parts - an enterprise drive is the best place to up spend your money profitably on an enterprise bit.

replies(2): >>39449317 #>>39454914 #
namibj ◴[] No.39454914[source]
Sadly the solution, a firmware variant with ZNS instead of the normal random write block device, just isn't on offer (please tell if I'm wrong; I'd love one!). Because with ZNS you can get away with tiny caps, large enough to complete the in-flight blocks (not the buffered ones, just those that are already at the flash chip itself), plus one metadata journal/ring buffer page to store write pointers and zone status for all zones touched since the last metadata write happened. Given that this should take about 100 μs, I don't see unannounced power loss really that problematic to deal with.

In theory the ATX PSU reports imminent power loss with a mandatory notice of no less than 1ms; this would easily be enough to finish in-flight writes and record the zone state.

replies(1): >>39475092 #
1. jauntywundrkind ◴[] No.39475092[source]
It's fucking wild to me the trajectory here:

OpenChannelIO: stop making such absurd totalizing systems that hide the actual flash! We don't need your fancy controllers! Just give us a bunch of flash that we can individually talk to & control ourselves from the host! You leave so much performance on the table! There's so many access patterns that wouldn't suck & would involve random rewrites of blocks!

Then after years of squabbling, eventually: ok we have negotiated for years. We will build new standards to allow these better access patterns that require nearly no drive controller intermediation, that have exponentially higher degrees of mechanistic sympathy of what flash actually does. Then we will never release drives mainstream & only have some hard to get unbelievably expensive enterprise drives that support it.

You are so f-ing spot on. This ZNS would be perfect for lower reliability consumer drives. But: market segmentation.

The situation is so fucked. We are so impeded from excelling, as a civilization, case number nine billion three hundred forty two.