Most active commenters

pclmulqdq(4)
bravetraveler(4)
wmf(4)
jsnell(3)
stingraycharles(3)

Popular/hot comments

>>39444464 #
>>39445545 #
>>39446130 #
>>39449574 #

←back to thread

SSDs have become fast, except in the cloud

(databasearchitects.blogspot.com)

Show context

pclmulqdq ◴[20 Feb 24 17:22 UTC] No.39443994[source]▶

>>39443679 (OP) #

This was a huge technical problem I worked on at Google, and is sort of fundamental to a cloud. I believe this is actually a big deal that drives peoples' technology directions.

SSDs in the cloud are attached over a network, and fundamentally have to be. The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

replies(30): >>39444009 #>>39444024 #>>39444028 #>>39444046 #>>39444062 #>>39444085 #>>39444096 #>>39444099 #>>39444120 #>>39444138 #>>39444328 #>>39444374 #>>39444396 #>>39444429 #>>39444655 #>>39444952 #>>39445035 #>>39445917 #>>39446161 #>>39446248 #>>39447169 #>>39447467 #>>39449080 #>>39449287 #>>39449377 #>>39449994 #>>39450169 #>>39450172 #>>39451330 #>>39466088 #

jsnell ◴[20 Feb 24 17:31 UTC] No.39444096[source]▶

>>39443994 #

According to the submitted article, the numbers are from AWS instance types where the SSD is "physically attached" to the host, not about SSD-backed NAS solutions.

Also, the article isn't just about SSDs being no faster than a network. It's about SSDs being two orders of magnitude slower than datacenter networks.

replies(3): >>39444161 #>>39444353 #>>39448728 #

pclmulqdq ◴[20 Feb 24 17:35 UTC] No.39444161[source]▶

>>39444096 #

It's because the "local" SSDs are not actually physically attached and there's a network protocol in the way.

replies(14): >>39444222 #>>39444248 #>>39444253 #>>39444261 #>>39444341 #>>39444352 #>>39444373 #>>39445175 #>>39446024 #>>39446163 #>>39446271 #>>39446742 #>>39446840 #>>39446893 #

1. jsnell ◴[20 Feb 24 17:51 UTC] No.39444373[source]▶

>>39444161 #

I think you're wrong about that. AWS calls this class of storage "instance storage" [0], and defines it as:

> Many Amazon EC2 instances can also include storage from devices that are located inside the host computer, referred to as instance storage.

There might be some wiggle room in "physically attached", but there's none in "storage devices located inside the host computer". It's not some kind of AWS-only thing either. GCP has "local SSD disks"[1], which I'm going to claim are likewise local, not over the network block storage. (Though the language isn't as explicit as for AWS.)

[0] https://aws.amazon.com/ec2/instance-types/

[1] https://cloud.google.com/compute/docs/disks#localssds

replies(5): >>39444464 #>>39445545 #>>39447509 #>>39449306 #>>39450882 #

2. 20after4 ◴[20 Feb 24 17:58 UTC] No.39444464[source]▶

>>39444373 (TP) #

If the SSD is installed in the host server, doesn't that still allow for it to be shared among many instances running on said host? I can imagine that a compute node has just a handful of SSDs and many hundreds of instances sharing the I/O bandwidth.

replies(5): >>39444584 #>>39444763 #>>39444820 #>>39445938 #>>39446130 #

3. discodave ◴[20 Feb 24 18:07 UTC] No.39444584[source]▶

>>39444464 #

If you have one of the metal instance types, then you get the whole host, e.g. i4i.metal:

https://aws.amazon.com/ec2/instance-types/i4i/

4. aeyes ◴[20 Feb 24 18:22 UTC] No.39444763[source]▶

>>39444464 #

On AWS yes, the older instances which I am familiar with had 900GB drives and they sliced that up into volumes of 600, 450, 300, 150, 75GB depending on instance size.

But they also tell you how much IOPS you get: https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/stora...

5. ownagefool ◴[20 Feb 24 18:25 UTC] No.39444820[source]▶

>>39444464 #

PCI bus, etc too

6. pclmulqdq ◴[20 Feb 24 19:14 UTC] No.39445545[source]▶

>>39444373 (TP) #

That's the abstraction they want you to work with, yes. That doesn't mean it's what is actually happening - at least not in the same way that you're thinking.

As a hint for you, I said "a network", not "the network." You can also look at public presentations about how Nitro works.

replies(4): >>39445944 #>>39446809 #>>39447308 #>>39447443 #

7. throwawaaarrgh ◴[20 Feb 24 19:44 UTC] No.39445938[source]▶

>>39444464 #

Instance storage is not networked. That's why it's there.

8. jng ◴[20 Feb 24 19:44 UTC] No.39445944[source]▶

>>39445545 #

Nitro "virtual NVME" device are mostly (only?) for EBS -- remote network storage, transparently managed, using a separate network backbone, and presented to the host as a regular local NVME device. SSD drives in instances such as i4i, etc. are physically attached in a different way -- but physically, unlike EBS, they are ephemeral and the content becomes unavaiable as you stop the instance, and when you restart, you get a new "blank slate". Their performance is 1 order of magnitude faster than standard-level EBS, and the cost structure is completely different (and many orders of magnitude more affordable than EBS volumes configured to have comparable I/O performance).

replies(1): >>39454138 #

9. queuebert ◴[20 Feb 24 19:59 UTC] No.39446130[source]▶

>>39444464 #

How do these machines manage the sharing of one local SSD across multiple VMs? Is there some wrapper around the I/O stack? Does it appear as a network share? Geniuinely curious...

replies(4): >>39446222 #>>39446276 #>>39446886 #>>39447488 #

10. felixg3 ◴[20 Feb 24 20:07 UTC] No.39446222{3}[source]▶

>>39446130 #

Probably NVME namespaces [0]?

[0]: https://nvmexpress.org/resource/nvme-namespaces/

replies(1): >>39448056 #

11. dan-robertson ◴[20 Feb 24 20:12 UTC] No.39446276{3}[source]▶

>>39446130 #

AWS have custom firmware for at least some of their SSDs, so could be that

12. jsnell ◴[20 Feb 24 20:57 UTC] No.39446809[source]▶

>>39445545 #

I've linked to public documentation that is pretty clearly in conflict with what you said. There's no wiggle room in how AWS describes their service without it being false advertising. There's no "ah, but what if we define the entire building to be the host computer, then the networked SSDs really are inside the host computer" sleight of hand to pull off here.

You've provided cryptic hints and a suggestion to watch some unnamed presentation.

At this point I really think the burden of proof is on you.

replies(2): >>39449527 #>>39451140 #

13. magicalhippo ◴[20 Feb 24 21:04 UTC] No.39446886{3}[source]▶

>>39446130 #

In say VirtualBox you can create a file backed on the physical disk, and attach it to the VM so the VM sees it as a NVMe drive.

In my experience this is also orders of magnitude slower that true direct access, ie PCIe pass-through, as all access has to pass through the VM storage driver and so could explain what is happening.

replies(1): >>39448073 #

14. dekhn ◴[20 Feb 24 21:39 UTC] No.39447308[source]▶

>>39445545 #

it sounds like you're trying to say "PCI switch" without saying "PCI switch" (I worked at Google for over a decade, including hardware division).

replies(1): >>39449574 #

15. jasonwatkinspdx ◴[20 Feb 24 21:50 UTC] No.39447443[source]▶

>>39445545 #

Both the documentation and Amazon employees are in here telling you that you're wrong. Can you resolve that contradiction or do you just want to act coy like you know some secret? The latter behavior is not productive.

replies(1): >>39450406 #

16. icedchai ◴[20 Feb 24 21:56 UTC] No.39447488{3}[source]▶

>>39446130 #

With Linux and KVM/QEMU, you can map an entire physical disk, disk partition, or file to a block device in the VM. For my own VM hosts, I use LVM and map a logical volume to the VM. I assumed cloud providers did something conceptually similar, only much more sophisticated.

replies(2): >>39448087 #>>39455138 #

17. wstuartcl ◴[20 Feb 24 21:58 UTC] No.39447509[source]▶

>>39444373 (TP) #

the tests were for these local (metal direct connect ssds). The issue is not network overhead -- its that just like everything else in cloud the performance of 10 years ago was used as the baseline that carries over today with upcharges to buy back the gains.

there is a reason why vcpu performance is still locked to the typical core from 10 years ago when every core on a machine today in those data scenters is 3-5x or more speed basis. Its cause they can charge you for 5x the cores to get that gain.

replies(2): >>39448553 #>>39450455 #

18. bravetraveler ◴[20 Feb 24 22:54 UTC] No.39448056{4}[source]▶

>>39446222 #

Less fancy, quite often... at least on VPS providers [1]. They like to use reflinked files off the base images. This way they only store what differs.

1: Which is really a cloud without a certain degree of software defined networking/compute/storage/whatever.

19. bravetraveler ◴[20 Feb 24 22:56 UTC] No.39448073{4}[source]▶

>>39446886 #

The storage driver may have more impact on VBox. You can get very impressive results with 'virtio' on KVM

replies(1): >>39448690 #

20. bravetraveler ◴[20 Feb 24 22:58 UTC] No.39448087{4}[source]▶

>>39447488 #

Files with reflinks are a common choice, the main benefit being: only storing deltas. The base OS costs basically nothing

LVM/block like you suggest is a good idea. You'd be surprised how much access time is trimmed by skipping another filesystem like you'd have with a raw image file

21. wmf ◴[20 Feb 24 23:56 UTC] No.39448553[source]▶

>>39447509 #

vcpu performance is still locked to the typical core from 10 years ago

No. In some cases I think AWS actually buys special processors that are clocked higher than the ones you can buy.

replies(2): >>39449463 #>>39450390 #

22. magicalhippo ◴[21 Feb 24 00:15 UTC] No.39448690{5}[source]▶

>>39448073 #

Yeah I've yet to try that. I know I get a similar lack of performance with Bhyve (FreeBSD) using VirtIO, so it's not a given it's fast.

I have no idea how AWS run their VMs, was just saying a slow storage driver could give such results.

replies(1): >>39449157 #

23. bravetraveler ◴[21 Feb 24 01:24 UTC] No.39449157{6}[source]▶

>>39448690 #

> just saying a slow storage driver could give such results

Oh, absolutely - not to contest that! There's a whole lot of academia on 'para-virtualized' and so on in this light.

That's interesting to hear about FreeBSD; basically all of my experience has been with Linux/Windows.

24. Hewitt821 ◴[21 Feb 24 01:53 UTC] No.39449306[source]▶

>>39444373 (TP) #

Local SSD is part of the machine, not network attached.

25. gowld ◴[21 Feb 24 02:19 UTC] No.39449463{3}[source]▶

>>39448553 #

You are talking about real CPU not virtual cpu

replies(1): >>39449912 #

26. stingraycharles ◴[21 Feb 24 02:31 UTC] No.39449527{3}[source]▶

>>39446809 #

You are correct, and the parent you’re replying to is confused. Nitro is for EBS, not the i3 local NVMe instances.

Those i3 instances lose your data whenever you stop and start them again (ie migrate to a different host machine), there’s absolutely no reason they would use network.

EBS itself uses a different network than the “normal” internet, if I were to guess it’s a converged Ethernet network optimized for iSCSI. Which is what Nitro optimizes for as well. But it’s not relevant for the local NVMe storage.

replies(1): >>39455152 #

27. pclmulqdq ◴[21 Feb 24 02:42 UTC] No.39449574{3}[source]▶

>>39447308 #

That is what I am trying to say without actually giving it out. PCIe switches are very much not transparent devices. Apparently AWS has not published anything about this, and doesn't have Nitro moderating access to "local" SSD, though - that I did get confused with EBS.

replies(3): >>39450226 #>>39450902 #>>39457669 #

28. wmf ◴[21 Feb 24 03:44 UTC] No.39449912{4}[source]▶

>>39449463 #

Generally each vCPU is a dedicated hardware thread, which has gotten significantly faster in the last 10 years. Only lambdas, micros, and nanos have shared vCPUs and those have probably also gotten faster although it's not guaranteed.

replies(1): >>39450872 #

29. pzb ◴[21 Feb 24 04:40 UTC] No.39450226{4}[source]▶

>>39449574 #

AWS has stated that there is a "Nitro Card for Instance Storage"[0][1] which is a NVMe PCIe controller that implements transparent encryption[2].

I don't have access to an EC2 instance to check, but you should be able to see the PCIe topology to determine how many physical cards are likely in i4i and im4gn and their PCIe connections. i4i claims to have 8 x 3,750 AWS Nitro SSD, but it isn't clear how many PCIe lanes are used.

Also, AWS claims "Traditionally, SSDs maximize the peak read and write I/O performance. AWS Nitro SSDs are architected to minimize latency and latency variability of I/O intensive workloads [...] which continuously read and write from the SSDs in a sustained manner, for fast and more predictable performance. AWS Nitro SSDs deliver up to 60% lower storage I/O latency and up to 75% reduced storage I/O latency variability [...]"

This could explain the findings in the article - they only meared peak r/w, not predictability.

[0] https://perspectives.mvdirona.com/2019/02/aws-nitro-system/ [1] https://aws.amazon.com/ec2/nitro/ [2] https://d1.awsstatic.com/events/reinvent/2019/REPEAT_2_Power...

30. phanimahesh ◴[21 Feb 24 05:14 UTC] No.39450390{3}[source]▶

>>39448553 #

The parent claims that though aws uses better hardware, they bill in vcpus whose benchmarks are from a few years ago, so that they can sell more vcpu units per performant physical cpu. This does not contradict your claim that aws buys better hardware.

replies(1): >>39450513 #

31. stingraycharles ◴[21 Feb 24 05:16 UTC] No.39450406{3}[source]▶

>>39447443 #

The parent thinks that AWS' i3 NVMe local instance storage is using a PCIe switch, which is not the case. EBS (and the AWS Nitro card) use a PCIe switch, and as such all EBS storage is exposed as e.g. /dev/nvmeXnY . But that's not the same as the i3 instances are offering, so the parent is confused.

32. flaminHotSpeedo ◴[21 Feb 24 05:27 UTC] No.39450455[source]▶

>>39447509 #

> there is a reason why vcpu performance is still locked to the typical core from 10 years ago

That is transparently nonsense.

You can disprove that claim in 5 minutes, and it makes literally zero sense for offerings that aren't oversubscribed

33. wmf ◴[21 Feb 24 05:39 UTC] No.39450513{4}[source]▶

>>39450390 #

It's so obviously wrong that I can't really explain it. Maybe someone else can. To believe that requires a complete misunderstanding of IaaS.

replies(1): >>39462313 #

34. jandrewrogers ◴[21 Feb 24 06:56 UTC] No.39450872{5}[source]▶

>>39449912 #

In fairness, there are a not insignificant number of workloads that do not benefit from hardware threads on CPUs [0], instead isolating processes along physical cores for optimal performance.

[0] Assertion not valid for barrel processors.

35. reactordev ◴[21 Feb 24 06:57 UTC] No.39450882[source]▶

>>39444373 (TP) #

AWS is so large, every concept of hardware is virtualized over a software layer. “Instance storage” is no different. It’s just closer to the edge with your node. It’s not some box in a rack where some AWS tech slots in an SSD. AWS has a hardware layer, but you’ll never see it.

36. rowanG077 ◴[21 Feb 24 07:00 UTC] No.39450902{4}[source]▶

>>39449574 #

Why are you acting as if PCIe switches are some secret technology? It was extremely grating for me to read your comments.

replies(2): >>39453547 #>>39453766 #

37. fcsp ◴[21 Feb 24 07:47 UTC] No.39451140{3}[source]▶

>>39446809 #

I see wiggle room in the statement you posted in that the SSD storage that is physically inside the machine hosting the instance might be mounted into the hypervised instance itself via some kind of network protocol still, adding overhead.

replies(1): >>39459703 #

38. stingraycharles ◴[21 Feb 24 13:34 UTC] No.39453547{5}[source]▶

>>39450902 #

Because the parent works/worked for Google, so obviously it must be super secret sauce that nobody has heard of. /s

Next up they’re going explain to us that iSCSI wants us to think it’s SCSI but it’s actually not!

39. the-rc ◴[21 Feb 24 13:51 UTC] No.39453766{5}[source]▶

>>39450902 #

Although it used them for years, the first mention by Google of PCIe switches was probably in the 2022 Aquila paper, which doesn't really talk about storage anyway...

replies(1): >>39457173 #

40. rcarmo ◴[21 Feb 24 14:21 UTC] No.39454138{3}[source]▶

>>39445944 #

This is the way Azure temporary volumes work as well. They are scrubbed off the hardware once the VM that accesses them is dead. Everything else is over the network.

41. jethro_tell ◴[21 Feb 24 15:41 UTC] No.39455138{4}[source]▶

>>39447488 #

Heh, you'd probably be surprised, there's some really cool cutting edge stuff being done in those data centers but a lot of what is done is just plan old standard server management without much in the way of tricks. Its just someone else does it instead of you and the billing department is counting milliseconds.

replies(1): >>39455931 #

42. MichaelZuo ◴[21 Feb 24 15:42 UTC] No.39455152{4}[source]▶

>>39449527 #

The argument could also be resolved by just getting the latency numbers for both cases and compare them, on bare metal it shouldn't be more than a few hundred nanoseconds.

43. icedchai ◴[21 Feb 24 16:32 UTC] No.39455931{5}[source]▶

>>39455138 #

Do cloud providers document these internals anywhere? I'd love to read about that sort of thing.

replies(1): >>39456966 #

44. jethro_tell ◴[21 Feb 24 17:45 UTC] No.39456966{6}[source]▶

>>39455931 #

Not generally, especially not the super generic stuff. Where they really excel is having the guy that wrote the kernel driver or hypervisor on staff. But a lot of it is just an automated version of what you'd do on a smaller scale

45. rowanG077 ◴[21 Feb 24 18:00 UTC] No.39457173{6}[source]▶

>>39453766 #

I don't understand why you would expect Google to state that. They have been standard technology for almost 2 decades. You don't see google claiming they use jtag or using SPI flash or whatever. It's just not special.

replies(1): >>39462921 #

46. dekhn ◴[21 Feb 24 18:39 UTC] No.39457669{4}[source]▶

>>39449574 #

Like many other people in this thread, I think we disagree that a PCI switch means that an SSD "is connected over a network" to the host bus.

Now if you can show me two or more hosts connected to a box of SSDs through a PCI switch (and some sort of cool tech for coordinating between the hosts), that's interesting.

47. eek2121 ◴[21 Feb 24 21:19 UTC] No.39459703{4}[source]▶

>>39451140 #

At minimum, the entire setup will be virtualized, which does add overhead.

48. asalahli ◴[22 Feb 24 02:15 UTC] No.39462313{5}[source]▶

>>39450513 #

GP is probably refferring to blurbs like these

> Amazon SimpleDB measures the machine utilization of each request and charges based on the amount of machine capacity used to complete the particular request (SELECT, GET, PUT, etc.), normalized to the hourly capacity of a circa 2007 1.7 GHz Xeon processor. See below for a more detailed description of how machine utilization charges are calculated.

https://aws.amazon.com/simpledb/pricing/

replies(1): >>39463018 #

49. the-rc ◴[22 Feb 24 03:36 UTC] No.39462921{7}[source]▶

>>39457173 #

Google didn't invent the Clos network, either, but it took years before they started talking about its adoption and with what kind of proprietary twists. Same with power supplies. You're right, a PCIe switch is not special, unless maybe it's integrated in some unconventional way. It's in Google's DNA to be cagey by default on a lot of details, to avoid giving ideas to the competition. Or misleading others down rabbit holes, like with shipping container datacenters.

replies(1): >>39466137 #

50. wmf ◴[22 Feb 24 03:48 UTC] No.39463018{6}[source]▶

>>39462313 #

SimpleDB is over 15 years old. I guess it's the only service still using "normalized" pricing. Newer services like RDS tell you exactly which processor you're getting and how many cores.

51. sitkack ◴[22 Feb 24 12:29 UTC] No.39466137{8}[source]▶

>>39462921 #

No, it dismisses technology until it does a 180 and then pretends it innovated in ways everyone is too stupid to understand. Google exceptionalism 101.

↑