Most active commenters
  • tumult(4)
  • seabrookmx(4)
  • c0l0(3)
  • bmicraft(3)

←back to thread

SSDs have become fast, except in the cloud

(databasearchitects.blogspot.com)
589 points greghn | 27 comments | | HN request time: 1.946s | source | bottom
1. c0l0 ◴[] No.39444187[source]
Seeing the really just puny "provisioned IOPS" numbers on hugely expensive cloud instances made me chuckle (first in disbelief, then in horror) when I joined a "cloud-first" enterprise shop in 2020 (having come from a company that hosted their own hardware at a colo).

It's no wonder that many people nowadays, esp. those who are so young that they've never experienced anything but cloud instances, seem to have little idea of how much performance you can actually pack in just one or two RUs today. Ultra-fast (I'm not parroting some marketing speak here - I just take a look at IOPS numbers, and compare them to those from highest-end storage some 10-12 years ago) NVMe storage is a big part of that astonishing magic.

replies(3): >>39448208 #>>39448367 #>>39449930 #
2. Aurornis ◴[] No.39448208[source]
> It's no wonder that many people nowadays, esp. those who are so young that they've never experienced anything but cloud instances, seem to have little idea of how much performance you can actually pack in just one or two RUs today.

On the contrary, young people often show up having learned on their super fast Apple SSD or a top of the line gaming machine with NVMe SSD.

Many know what hardware can do. There’s no need to dunk on young people.

Anyway, the cloud performance realities are well know to anyone who works in cloud performance. It’s part of the game and it’s learned by anyone scaling a system. It doesn’t really matter what you could do if you build a couple RUs yourself and hauled them down to the data center, because beyond simple single-purpose applications with flexible uptime requirements, that’s not a realistic option.

replies(2): >>39448623 #>>39449212 #
3. jauntywundrkind ◴[] No.39448367[source]
NVMe has been ridiculously great. I'm excited to see what happens to prices as E1 form factor ramps up! Much physically bigger drives allows for consolidation of parts, a higher ratio of flash chips to everything else, which seems promising. It's more a value line, but Intel's P5315 is 15TB at a quite low $0.9/GB.

It might not help much with oops though. Amazing that we have PCIe 5.0 16GB/s and already are so near theoretical max (some lost to overhead), even on consumer cards.

Going enterprise for the drive-writes-per-day (DWPD) is 100% worth it for most folks, but I am morbidly curious how different the performance profile would be running enterprise vs non these days. But reciprocally the high DWPD drives (Kioxia CD8P-V for example is DWPD of 3) seems to often come with somewhat more mild sustained 4k write oops, making me think maybe there's a speed vs reliability tradeoff that could be taken advantage of from consumer drives in some cases; not sure who wants tons of iops but doesn't actually intend to hit their Total Drive Writes, but it save you some iops/$ if so. That said, I'm shocked to see the enterprise premium is a lot less absurd than it used to be! (If you can find stock.)

replies(1): >>39449159 #
4. zten ◴[] No.39448623[source]
> On the contrary, young people often show up having learned on their super fast Apple SSD or a top of the line gaming machine with NVMe SSD.

Yes, this is often a big surprise. You can test out some disk-heavy app locally on your laptop and observe decent performance, and then have your day completely ruined when you provision a slice of an NVMe SSD instance type (like, i4i.2xlarge) and discover you're only paying for SATA SSD performance.

replies(1): >>39450654 #
5. bcaxis ◴[] No.39449159[source]
The main problem with consumer drives is the missing power loss protection (plp). M.2 drives just don't have space for the caps like an enterprise 2.5 u.2/u.3 drive will have.

This matters when the DB calls a sync and it's expecting the data to be written safely to disk before it returns.

A consumer drive basically stops everything until it can report success and your IOPS falls to like 1/100th of what the drive is capable of if it's happening alot.

An enterprise drive with plp will just report success knowing it has the power to finish the pending writes. Full speed ahead.

You can "lie" to the process at the VPS level by enabling unsafe write back cache. You can do it at the OS level by launching the DB with "eatmydata". You will get the full performance of your SSD.

In the event of power loss you may well end up in an unrecoverable corrupted condition with these enabled.

I believe that if you buy all consumer parts - an enterprise drive is the best place to up spend your money profitably on an enterprise bit.

replies(2): >>39449317 #>>39454914 #
6. EB66 ◴[] No.39449212[source]
> because beyond simple single-purpose applications with flexible uptime requirements, that’s not a realistic option.

I frequently hear this point expressed in cloud vs colo debates. The notion that you can't achieve high availability with simple colo deploys is just nonsense.

Two colo deploys in two geographically distinct datacenters, two active physical servers with identical builds (RAIDed drives, dual NICs, A+B power) in both datacenters, a third server racked up just sitting as a cold spare, pick your favorite container orchestration scheme, rig up your database replication, script the database failover activation process, add HAProxy (or use whatever built-in scheme your orchestration system offers), sprinkle in a cloud service for DNS load balancing/failover (Cloudflare or AWS Route 53), automate and store backups off-site and you're done.

Yes it's a lot of work, but so is configuring a similar level of redundancy and high availability in AWS. I've done it both ways and I prefer the bare metal colo approach. With colo you get vastly more bang for your buck and when things go wrong, you have a greater ability to get hands on, understand exactly what's going on and fix it immediately.

replies(1): >>39452305 #
7. tumult ◴[] No.39449317{3}[source]
My experience lately is that consumer drives will also lie and use a cache, but then drop your data on the floor if the power is lost or there’s a kernel panic / BSOD. (Samsung and others.)
replies(1): >>39449891 #
8. bcaxis ◴[] No.39449891{4}[source]
Rumors of that. I've never actually seen it myself.
replies(3): >>39450500 #>>39451747 #>>39451836 #
9. dboreham ◴[] No.39449930[source]
Some of us are making a good living offboarding workloads from cloud onto bare metal with on-node NVMe storage.
replies(1): >>39451814 #
10. hypercube33 ◴[] No.39450500{5}[source]
Only thing I ever have seen is some cheap Samsung drives slow to a crawl when their buffer fills or those super old Intel ssds that power loss to 8mb due to some firmware bug.
11. seabrookmx ◴[] No.39450654{3}[source]
This doesn't stop at SSD's.

Spin up an E2 VM in Google Cloud and there's a good chance you'll get a nearly 9 year Broadwell architecture chip running your workload!

replies(1): >>39477637 #
12. tumult ◴[] No.39451747{5}[source]
I can get it to happen easily. 970 Evo Plus. Write a text file and kill the power within 20 seconds or so, assuming not much other write activity. File will be zeroes or garbage, or not present on the filesystem, after reboot.
replies(1): >>39452783 #
13. dijit ◴[] No.39451814[source]
Really? I'd like to do this as a job.

Are you hiring?

Cloud is great for prototyping or randomly elastic workloads, but it feels like people are pushing highly static workloads from on-prem to cloud. I'd love to be part of the change going the other way. Especially since the skills for doing so seem to have dried up completely.

14. dijit ◴[] No.39451836{5}[source]
Eh, I've definitely seen it.

I buy Samsung drives relatively exclusively if that makes any difference.

All that to say though: this is why things like journalling and write-ahead systems exist. OS design is mostly about working around physical (often physics related) limitations of hardware and one of those is what to do if you get caught in a situation where something is incomplete.

The prevailing methodology is to paper over it with some atomic actions. For example: Copy-on-Write or POSIX move semantics (rename(2)).

Then some spiffy young dev comes along and turns off all of those guarantees and says they made something ultra fast (*cough*mongodb*cough*) then maybe claims those guarantees are somewhere up the stack instead. This is almost always a lie.

Also: Beware any database that only syncs to VFS.

15. joshstrange ◴[] No.39452305{3}[source]
I doubt you’ll find anyone who disagrees that colo is much cheaper and that it’s possible to have failover with little to no downtime. Same with higher performance on bare metal vs a public cloud. Or at least I’ve never thought differently.

The difference is setting up all of that and maintaining it/debugging when something goes wrong is not a small task IMHO.

For some companies with that experience in-house I can understand doing it all yourself. As a solo founder and an employee of a small company we don’t have the bandwidth to do all of that without hiring 1+ more people which are more expensive than the cloud costs.

If we were drive-speed-constrained and getting that speed just wasn’t possible then maybe the math would shift further in favor of colo but we aren’t. Also upgrading the hardware our servers run on is fairly straightforward vs replacing a server on a rack or dealing with failing/older hardware.

16. c0l0 ◴[] No.39452783{6}[source]
This happens for you after you invoked an explicit sync() (et al.) before the power cut?
replies(1): >>39453372 #
17. tumult ◴[] No.39453372{7}[source]
Yep.
replies(1): >>39455489 #
18. namibj ◴[] No.39454914{3}[source]
Sadly the solution, a firmware variant with ZNS instead of the normal random write block device, just isn't on offer (please tell if I'm wrong; I'd love one!). Because with ZNS you can get away with tiny caps, large enough to complete the in-flight blocks (not the buffered ones, just those that are already at the flash chip itself), plus one metadata journal/ring buffer page to store write pointers and zone status for all zones touched since the last metadata write happened. Given that this should take about 100 μs, I don't see unannounced power loss really that problematic to deal with.

In theory the ATX PSU reports imminent power loss with a mandatory notice of no less than 1ms; this would easily be enough to finish in-flight writes and record the zone state.

replies(1): >>39475092 #
19. c0l0 ◴[] No.39455489{8}[source]
That is highly interesting and contrary to a number of reports I've read about the Samsung 970 EVO Plus Series (and experienced for myself) specifically! Can you share more details about your particular setup and methodology? (Specific model name/capacity, Firmware release, Kernel version, filesystem, mkfs and mount options, any relevant block layer funny business you are conciously setting would be of greatest interest.) Do you have more than one drive where this can happen?
replies(1): >>39456879 #
20. tumult ◴[] No.39456879{9}[source]
Yeah, it happens on two of the 970 EVO Plus models. One on the older revision, and one on the newer. (I think there are only two?) It happens on both Linux and Windows. Uhh, I'm not sure about the kernel versions. I don't remember what I had booted at the time. On Windows I've seen it happen as far back as 1607 and as recently as 21H2. I've also seen it happen on someone else's computer (laptop.)

It's really easy to reproduce (at least for me?) and I'm pretty sure anyone can do it if they try to on purpose.

21. jauntywundrkind ◴[] No.39475092{4}[source]
It's fucking wild to me the trajectory here:

OpenChannelIO: stop making such absurd totalizing systems that hide the actual flash! We don't need your fancy controllers! Just give us a bunch of flash that we can individually talk to & control ourselves from the host! You leave so much performance on the table! There's so many access patterns that wouldn't suck & would involve random rewrites of blocks!

Then after years of squabbling, eventually: ok we have negotiated for years. We will build new standards to allow these better access patterns that require nearly no drive controller intermediation, that have exponentially higher degrees of mechanistic sympathy of what flash actually does. Then we will never release drives mainstream & only have some hard to get unbelievably expensive enterprise drives that support it.

You are so f-ing spot on. This ZNS would be perfect for lower reliability consumer drives. But: market segmentation.

The situation is so fucked. We are so impeded from excelling, as a civilization, case number nine billion three hundred forty two.

22. bmicraft ◴[] No.39477637{4}[source]
What this tells me is that the price of running inefficient cpus seemingly isn't nearly as high as I thought it would or should be (in terms of usd/kWh)
replies(1): >>39488811 #
23. seabrookmx ◴[] No.39488811{5}[source]
Well they bill you for the instance not for some unit of computation. I'd imagine many users of E2 instances don't realize that they could be getting much much worse performance per vcore than if they picked a different instance type.

From Google's perspective, if the hardware is paid for, still reliable, and they can still make money on it, they can put new hardware in new racks rather than replacing the old hardware. This suggests Google's DC's aren't space constrained but I'm not surprised after looking at a few via satellite images!

replies(1): >>39503542 #
24. bmicraft ◴[] No.39503542{6}[source]
Well not exactly. In my mind the price of running such old cpus for say the last (say, 4?) years would have been higher than buying new+new runtime costs. Those would definitely be considered opportunity costs that ought to be avoided.
replies(1): >>39516066 #
25. seabrookmx ◴[] No.39516066{7}[source]
> the price of running such old cpus for say the last (say, 4?) years would have been higher than buying new+new runtime costs

I don't think this is true, because the old chips don't use more power outright[1][2][3]. In fact in many cases new chips use more power due to the higher core density. The new chips are way more efficient because they do more work per watt, but like I said in my previous comment you aren't paying for a unit of work. The billing model for the cloud providers is that of a rental: you pay per minute for the instance.

There's complexity here like being able to pack more "instances" (VM's) onto a physical host with the higher core count machines, but simply saying the new hardware is cheaper to run I don't think is clear cut.

[1]: https://cloud.google.com/compute/docs/cpu-platforms#intel_pr...

[2]: https://www.intel.com/content/www/us/en/products/sku/93792/i...

[3]: https://www.intel.com/content/www/us/en/products/sku/231746/...

replies(1): >>39539036 #
26. bmicraft ◴[] No.39539036{8}[source]
True, although they could very well do upgrades on the kind of VPS' where they were already oversubscribed. If you're not paying for physical cores I don't think that argument works.
replies(1): >>39567739 #
27. seabrookmx ◴[] No.39567739{9}[source]
Sure but my comment was about Google's E2 instances specifically, which are billed this way. For Cloud Run or the Google Services they host, I agree it would be odd for them to use old chips given the inefficiency.