Most active commenters

pclmulqdq(5)
dekhn(5)
yolovoe(5)
bravetraveler(4)
wmf(4)
jsnell(3)
ianburrell(3)
tptacek(3)
solardev(3)
stingraycharles(3)

Popular/hot comments

>>39444161 #
>>39444373 #
>>39444464 #
>>39447200 #
>>39446130 #
>>39445545 #
>>39445175 #
>>39444961 #
>>39444326 #
>>39446840 #
>>39444253 #
>>39447445 #
>>39448983 #
>>39449574 #

←back to thread

SSDs have become fast, except in the cloud

(databasearchitects.blogspot.com)

Show context

pclmulqdq ◴[20 Feb 24 17:22 UTC] No.39443994[source]▶

>>39443679 (OP) #

This was a huge technical problem I worked on at Google, and is sort of fundamental to a cloud. I believe this is actually a big deal that drives peoples' technology directions.

SSDs in the cloud are attached over a network, and fundamentally have to be. The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.

replies(30): >>39444009 #>>39444024 #>>39444028 #>>39444046 #>>39444062 #>>39444085 #>>39444096 #>>39444099 #>>39444120 #>>39444138 #>>39444328 #>>39444374 #>>39444396 #>>39444429 #>>39444655 #>>39444952 #>>39445035 #>>39445917 #>>39446161 #>>39446248 #>>39447169 #>>39447467 #>>39449080 #>>39449287 #>>39449377 #>>39449994 #>>39450169 #>>39450172 #>>39451330 #>>39466088 #

1. jsnell ◴[20 Feb 24 17:31 UTC] No.39444096[source]▶

>>39443994 #

According to the submitted article, the numbers are from AWS instance types where the SSD is "physically attached" to the host, not about SSD-backed NAS solutions.

Also, the article isn't just about SSDs being no faster than a network. It's about SSDs being two orders of magnitude slower than datacenter networks.

replies(3): >>39444161 #>>39444353 #>>39448728 #

2. pclmulqdq ◴[20 Feb 24 17:35 UTC] No.39444161[source]▶

>>39444096 (TP) #

It's because the "local" SSDs are not actually physically attached and there's a network protocol in the way.

replies(14): >>39444222 #>>39444248 #>>39444253 #>>39444261 #>>39444341 #>>39444352 #>>39444373 #>>39445175 #>>39446024 #>>39446163 #>>39446271 #>>39446742 #>>39446840 #>>39446893 #

3. zokier ◴[20 Feb 24 17:39 UTC] No.39444222[source]▶

>>39444161 #

What makes you think that?

4. ddorian43 ◴[20 Feb 24 17:41 UTC] No.39444248[source]▶

>>39444161 #

Do you have a link to explain this? I dont think its true.

5. candiddevmike ◴[20 Feb 24 17:41 UTC] No.39444253[source]▶

>>39444161 #

Depends on the cloud provider. Local SSDs are physically attached to the host on GCP, but that makes them only useful for temporary storage.

replies(3): >>39444326 #>>39444754 #>>39445986 #

6. mike_hearn ◴[20 Feb 24 17:42 UTC] No.39444261[source]▶

>>39444161 #

They do this because they want SSDs to be in a physically separate part of the building for operational reasons, or what's the point in giving you a "local" SSD that isn't actually plugged into the real machine?

replies(2): >>39444961 #>>39446759 #

7. pclmulqdq ◴[20 Feb 24 17:48 UTC] No.39444326{3}[source]▶

>>39444253 #

If you're at G, you should read the internal docs on exactly how this happens and it will be interesting.

replies(3): >>39444529 #>>39450240 #>>39450805 #

8. colechristensen ◴[20 Feb 24 17:49 UTC] No.39444341[source]▶

>>39444161 #

For AWS there are EBS volumes attached through a custom hardware NVMe interface and then there's Instance Store which is actually local SSD storage. These are different things.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Instance...

replies(1): >>39444710 #

9. jrullman ◴[20 Feb 24 17:50 UTC] No.39444352[source]▶

>>39444161 #

I can attest to the fact that on EC2, "instance store" volumes are actually physically attached.

10. crazygringo ◴[20 Feb 24 17:50 UTC] No.39444353[source]▶

>>39444096 (TP) #

> It's about SSDs being two orders of magnitude slower than datacenter networks.

Could that have to do with every operation requiring a round trip, rather than being able to queue up operations in a buffer to saturate throughput?

It seems plausible if the interface protocol was built for a device it assumed was physically local and so waited for confirmation after each operation before performing the next.

In this case it's not so much the throughput rate that matters, but the latency -- which can also be heavily affected by buffering of other network traffic.

replies(1): >>39444467 #

11. jsnell ◴[20 Feb 24 17:51 UTC] No.39444373[source]▶

>>39444161 #

I think you're wrong about that. AWS calls this class of storage "instance storage" [0], and defines it as:

> Many Amazon EC2 instances can also include storage from devices that are located inside the host computer, referred to as instance storage.

There might be some wiggle room in "physically attached", but there's none in "storage devices located inside the host computer". It's not some kind of AWS-only thing either. GCP has "local SSD disks"[1], which I'm going to claim are likewise local, not over the network block storage. (Though the language isn't as explicit as for AWS.)

[0] https://aws.amazon.com/ec2/instance-types/

[1] https://cloud.google.com/compute/docs/disks#localssds

replies(5): >>39444464 #>>39445545 #>>39447509 #>>39449306 #>>39450882 #

12. 20after4 ◴[20 Feb 24 17:58 UTC] No.39444464{3}[source]▶

>>39444373 #

If the SSD is installed in the host server, doesn't that still allow for it to be shared among many instances running on said host? I can imagine that a compute node has just a handful of SSDs and many hundreds of instances sharing the I/O bandwidth.

replies(5): >>39444584 #>>39444763 #>>39444820 #>>39445938 #>>39446130 #

13. Nextgrid ◴[20 Feb 24 17:58 UTC] No.39444467[source]▶

>>39444353 #

Underlying protocol limitations wouldn't be an issue - the cloud provider's implementation can work around that. They're unlikely to be sending sequential SCSI/NVMe commands over the wire - instead, the hypervisor pretends to be the NVME device, but then converts to some internal protocol (that's less chatty and can coalesce requests without waiting on individual ACKs) before sending that to the storage server.

The problem is that ultimately your application often requires the outcome of a given IO operation to decide which operation to perform next - let's say when it comes to a database, it should first read the index (and wait for that to complete) before it knows the on-disk location of the actual row data which it needs to be able to issue the next IO operation.

In this case, there's no other solution than to move that application closer to the data itself. Instead of the networked storage node being a dumb blob storage returning bytes, the networked "storage" node is your database itself, returning query results. I believe that's what RDS Aurora does for example, every storage node can itself understand query predicates.

14. rfoo ◴[20 Feb 24 18:02 UTC] No.39444529{4}[source]▶

>>39444326 #

Why would I lose all data on these SSDs when I initiate a power off of the VM on console, then?

I believe local SSDs are definitely attached to the host. They are just not exposed via NVMe ZNS hence the performance hit.

replies(2): >>39444859 #>>39445006 #

15. discodave ◴[20 Feb 24 18:07 UTC] No.39444584{4}[source]▶

>>39444464 #

If you have one of the metal instance types, then you get the whole host, e.g. i4i.metal:

https://aws.amazon.com/ec2/instance-types/i4i/

16. kwillets ◴[20 Feb 24 18:17 UTC] No.39444710{3}[source]▶

>>39444341 #

EBS is also slower than local NVMe mounts on i3's.

Also, both features use Nitro SSD cards, according to AWS docs. The Nitro architecture is all locally attached -- instance storage to the instance, EBS to the EBS server.

17. amluto ◴[20 Feb 24 18:21 UTC] No.39444754{3}[source]▶

>>39444253 #

Which is a weird sort of limitation. For any sort of you-own-the-hardware arrangement, NVMe disks are fine for long term storage. (Obviously one should have backups, but that’s a separate issue. One should have a DR plan for data on EBS, too.)

You need to migrate that data if you replace an entire server, but this usually isn’t a very big deal.

replies(1): >>39444869 #

18. aeyes ◴[20 Feb 24 18:22 UTC] No.39444763{4}[source]▶

>>39444464 #

On AWS yes, the older instances which I am familiar with had 900GB drives and they sliced that up into volumes of 600, 450, 300, 150, 75GB depending on instance size.

But they also tell you how much IOPS you get: https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/stora...

19. ownagefool ◴[20 Feb 24 18:25 UTC] No.39444820{4}[source]▶

>>39444464 #

PCI bus, etc too

20. manquer ◴[20 Feb 24 18:28 UTC] No.39444859{5}[source]▶

>>39444529 #

It is because on reboot you may not get the same physical server . They are not rebooting the physical server for you , just the VM

Same VM is not allocated for a variety of reasons , scheduled maintenance, proximity to other hosts on the vpc , balancing quiet and noisy neighbors so on.

It is not that the disk will always wiped , sometimes the data is still there on reboot just that there is no guarantee allowing them to freely move between hosts

replies(1): >>39448758 #

21. supriyo-biswas ◴[20 Feb 24 18:28 UTC] No.39444869{4}[source]▶

>>39444754 #

This is Hyrum’s law at play: AWS wants to make sure that the instance stores aren’t seen as persistent, and therefore enforce the failure mode for normal operations as well.

You should also see how they enforce similar things for their other products and APIs, for example, most of their services have encrypted pagination tokens.

22. ianburrell ◴[20 Feb 24 18:34 UTC] No.39444961{3}[source]▶

>>39444261 #

The reason for having most instances use network storage is that it makes possible migrating instances to other hosts. If the host fails, the network storage can be pointed at the new host with a reboot. AWS sends out notices regularly when they are going to reboot or migrate instances.

Their probably should be more local instance storage types for using with instances that can be recreated without loss. But it is simple for them to have a single way of doing things.

At work, someone used fast NVMe instance storage for Clickhouse which is a database. It was a huge hassle to copy data when instances were going to be restarted because the data would be lost.

replies(3): >>39445385 #>>39446444 #>>39447003 #

23. res0nat0r ◴[20 Feb 24 18:37 UTC] No.39445006{5}[source]▶

>>39444529 #

Your EC2 instance with instance-store storage when stopped can be launched on any other random host in the AZ when you power it back on. Since your rootdisk is an EBS volume attached across the network, so when you start your instance back up you're going to be launched likely somewhere else with an empty slot, and empty local-storage. This is why there is always a disclaimer that this local storage is ephemeral and don't count on it being around long-term.

replies(1): >>39446333 #

24. dekhn ◴[20 Feb 24 18:47 UTC] No.39445175[source]▶

>>39444161 #

I suspect you must be conflating several different storage products. Are you saying https://cloud.google.com/compute/docs/disks/local-ssd devices talk to the host through a network (say, ethernet with some layer on top)? Because the documentation very clearly says otherwise, "This is because Local SSD disks are physically attached to the server that hosts your VM. For this same reason, Local SSD disks can only provide temporary storage." (at least, I'm presuming that by physically attached, they mean it's connected to the PCI bus without a network in between).

I suspect you're thinking of SSD-PD. If "local" SSDs are not actually local and go through a network, I need to have a discussion with my GCS TAM about truth in advertising.

replies(3): >>39445299 #>>39446138 #>>39449322 #

25. op00to ◴[20 Feb 24 18:55 UTC] No.39445299{3}[source]▶

>>39445175 #

> physically attached

Believe it or not, superglue and a wifi module! /s

26. mike_hearn ◴[20 Feb 24 19:02 UTC] No.39445385{4}[source]▶

>>39444961 #

Sure, I understand that, but this user is claiming that on GCP even local SSDs aren't really local, which raises the question of why not.

I suspect the answer is something to do with their manufacturing processes/rack designs. When I worked there (pre GCP) machines had only a tiny disk used for booting and they wanted to get rid of that. Storage was handled by "diskful" machines that had dedicated trays of HDDs connected to their motherboards. If your datacenters and manufacturing processes are optimized for building machines that are either compute or storage but not both, perhaps the more normal cloud model is hard to support and that pushes you towards trying to aggregate storage even for "local" SSD or something.

replies(2): >>39445512 #>>39450483 #

27. deadmutex ◴[20 Feb 24 19:11 UTC] No.39445512{5}[source]▶

>>39445385 #

The GCE claim is unverified. OP seems to be referring to PD-SSD and not LocalSSD

replies(1): >>39450257 #

28. pclmulqdq ◴[20 Feb 24 19:14 UTC] No.39445545{3}[source]▶

>>39444373 #

That's the abstraction they want you to work with, yes. That doesn't mean it's what is actually happening - at least not in the same way that you're thinking.

As a hint for you, I said "a network", not "the network." You can also look at public presentations about how Nitro works.

replies(4): >>39445944 #>>39446809 #>>39447308 #>>39447443 #

29. throwawaaarrgh ◴[20 Feb 24 19:44 UTC] No.39445938{4}[source]▶

>>39444464 #

Instance storage is not networked. That's why it's there.

30. jng ◴[20 Feb 24 19:44 UTC] No.39445944{4}[source]▶

>>39445545 #

Nitro "virtual NVME" device are mostly (only?) for EBS -- remote network storage, transparently managed, using a separate network backbone, and presented to the host as a regular local NVME device. SSD drives in instances such as i4i, etc. are physically attached in a different way -- but physically, unlike EBS, they are ephemeral and the content becomes unavaiable as you stop the instance, and when you restart, you get a new "blank slate". Their performance is 1 order of magnitude faster than standard-level EBS, and the cost structure is completely different (and many orders of magnitude more affordable than EBS volumes configured to have comparable I/O performance).

replies(1): >>39454138 #

31. throwawaaarrgh ◴[20 Feb 24 19:47 UTC] No.39445986{3}[source]▶

>>39444253 #

Yes, that's what their purpose is in cloud applications: temporary high performance storage only.

If you want long term local storage you'll have to reserve an instance host.

32. crotchfire ◴[20 Feb 24 19:49 UTC] No.39446024[source]▶

>>39444161 #

This is incorrect.

Amazon offers both locally-attached storage devices as well as instance-attached storage devices. The article is about the latter kind.

33. queuebert ◴[20 Feb 24 19:59 UTC] No.39446130{4}[source]▶

>>39444464 #

How do these machines manage the sharing of one local SSD across multiple VMs? Is there some wrapper around the I/O stack? Does it appear as a network share? Geniuinely curious...

replies(4): >>39446222 #>>39446276 #>>39446886 #>>39447488 #

34. mint2 ◴[20 Feb 24 20:00 UTC] No.39446138{3}[source]▶

>>39445175 #

I don’t really agree with assuming the form of physical attachment and interaction unless it is spelled out.

If that’s what’s meant it will be stated in some fine print, if it’s not stated anywhere then there is no guarantee what the term means, except I would guess they may want people to infer things that may not necessarily be true.

replies(1): >>39446940 #

35. bfncieezo ◴[20 Feb 24 20:02 UTC] No.39446163[source]▶

>>39444161 #

instances can have block storage which is network attached, or local attached ssd/nvme. its 2 separate things.

36. felixg3 ◴[20 Feb 24 20:07 UTC] No.39446222{5}[source]▶

>>39446130 #

Probably NVME namespaces [0]?

[0]: https://nvmexpress.org/resource/nvme-namespaces/

replies(1): >>39448056 #

37. choppaface ◴[20 Feb 24 20:12 UTC] No.39446271[source]▶

>>39444161 #

Nope! Well not as advertised. There are instances, usually more expensive ones, where there are supposed to be local NVME disks dedicated to the instance. You're totally right that providing good I/O is a big problem! And I have done studies myself showing just how bad Google Cloud is here, and have totally ditched Google Cloud for providing crappy compute service (and even worse customer service).

38. dan-robertson ◴[20 Feb 24 20:12 UTC] No.39446276{5}[source]▶

>>39446130 #

AWS have custom firmware for at least some of their SSDs, so could be that

39. mrcarrot ◴[20 Feb 24 20:17 UTC] No.39446333{6}[source]▶

>>39445006 #

I think the parent was agreeing with you. If the “local” SSDs _weren’t_ actually local, then presumably they wouldn’t need to be ephemeral since they could be connected over the network to whichever host your instance was launched on.

40. youngtaff ◴[20 Feb 24 20:26 UTC] No.39446444{4}[source]▶

>>39444961 #

> At work, someone used fast NVMe instance storage for Clickhouse which is a database. It was a huge hassle to copy data when instances were going to be restarted because the data would be lost.

This post on how Discord RAIDed local NVMe volumes with slower remote volumes might be on interest https://discord.com/blog/how-discord-supercharges-network-di...

replies(1): >>39447097 #

41. yolovoe ◴[20 Feb 24 20:51 UTC] No.39446742[source]▶

>>39444161 #

You’re wrong. Instance local means SSD is physically attached to the droplet and is inside the server chassis, connected via PCIe.

Sourece: I work on nitro cards.

replies(1): >>39447200 #

42. yolovoe ◴[20 Feb 24 20:53 UTC] No.39446759{3}[source]▶

>>39444261 #

The comment you’re responding to is wrong. AWS offers many kinds of storage. Instance local storage is physically attached to the droplet. EBS isn’t but that’s a separate thing entirely.

I literally work in EC2 Nitro.

43. jsnell ◴[20 Feb 24 20:57 UTC] No.39446809{4}[source]▶

>>39445545 #

I've linked to public documentation that is pretty clearly in conflict with what you said. There's no wiggle room in how AWS describes their service without it being false advertising. There's no "ah, but what if we define the entire building to be the host computer, then the networked SSDs really are inside the host computer" sleight of hand to pull off here.

You've provided cryptic hints and a suggestion to watch some unnamed presentation.

At this point I really think the burden of proof is on you.

replies(2): >>39449527 #>>39451140 #

44. hathawsh ◴[20 Feb 24 21:00 UTC] No.39446840[source]▶

>>39444161 #

That seems like a big opportunity for other cloud providers. They could provide SSDs that are actually physically attached and boast (rightfully) that their SSDs are a lot faster, drawing away business from older cloud providers.

replies(3): >>39447445 #>>39447849 #>>39449642 #

45. magicalhippo ◴[20 Feb 24 21:04 UTC] No.39446886{5}[source]▶

>>39446130 #

In say VirtualBox you can create a file backed on the physical disk, and attach it to the VM so the VM sees it as a NVMe drive.

In my experience this is also orders of magnitude slower that true direct access, ie PCIe pass-through, as all access has to pass through the VM storage driver and so could explain what is happening.

replies(1): >>39448073 #

46. Salgat ◴[20 Feb 24 21:05 UTC] No.39446893[source]▶

>>39444161 #

At first you'd think maybe they can do a volume copy from a snapshot to a local drive on instance creation but even at 100gbps you're looking at almost 3 minutes for a 2TB drive.

47. dekhn ◴[20 Feb 24 21:09 UTC] No.39446940{4}[source]▶

>>39446138 #

"Physically attached" has had a fairly well defined meaning and i don't normally expect a cloud provider to play word salad to convince me a network drive is locally attached (like I said, if true, I would need to have a chat with my TAM about it).

Physically attached for servers, for the past 20+ years, has meant a direct electrical connection to a host bus (such as the PCI bus attached to the front-side bus). I'd like to see some alternative examples that violate that convention.

replies(1): >>39447819 #

48. wiredfool ◴[20 Feb 24 21:15 UTC] No.39447003{4}[source]▶

>>39444961 #

Are you saying that a reboot wipes the ephemeral disks? Or a stop the instance and start the instance from AWS console/api?

replies(1): >>39447073 #

49. ianburrell ◴[20 Feb 24 21:21 UTC] No.39447073{5}[source]▶

>>39447003 #

Reboot keeps the instance storage volumes. Restarting wipes them. Starting frequently migrates to new host. And the "restart" notices AWS sends are likely cause the host has a problem and need to migrate it.

50. ianburrell ◴[20 Feb 24 21:23 UTC] No.39447097{5}[source]▶

>>39446444 #

We moved to running Clickhouse on EKS with EBS volumes for storage. It can better survive instances going down. I didn't work on it so don't how much slower it is. Lowering the management burden was big priority.

51. tptacek ◴[20 Feb 24 21:32 UTC] No.39447200{3}[source]▶

>>39446742 #

"Attached to the droplet"?

replies(4): >>39447513 #>>39447821 #>>39453545 #>>39462446 #

52. dekhn ◴[20 Feb 24 21:39 UTC] No.39447308{4}[source]▶

>>39445545 #

it sounds like you're trying to say "PCI switch" without saying "PCI switch" (I worked at Google for over a decade, including hardware division).

replies(1): >>39449574 #

53. jasonwatkinspdx ◴[20 Feb 24 21:50 UTC] No.39447443{4}[source]▶

>>39445545 #

Both the documentation and Amazon employees are in here telling you that you're wrong. Can you resolve that contradiction or do you just want to act coy like you know some secret? The latter behavior is not productive.

replies(1): >>39450406 #

54. solardev ◴[20 Feb 24 21:51 UTC] No.39447445{3}[source]▶

>>39446840 #

For what kind of workloads would a slower SSD be a significant bottleneck?

replies(3): >>39448439 #>>39449079 #>>39449776 #

55. icedchai ◴[20 Feb 24 21:56 UTC] No.39447488{5}[source]▶

>>39446130 #

With Linux and KVM/QEMU, you can map an entire physical disk, disk partition, or file to a block device in the VM. For my own VM hosts, I use LVM and map a logical volume to the VM. I assumed cloud providers did something conceptually similar, only much more sophisticated.

replies(2): >>39448087 #>>39455138 #

56. wstuartcl ◴[20 Feb 24 21:58 UTC] No.39447509{3}[source]▶

>>39444373 #

the tests were for these local (metal direct connect ssds). The issue is not network overhead -- its that just like everything else in cloud the performance of 10 years ago was used as the baseline that carries over today with upcharges to buy back the gains.

there is a reason why vcpu performance is still locked to the typical core from 10 years ago when every core on a machine today in those data scenters is 3-5x or more speed basis. Its cause they can charge you for 5x the cores to get that gain.

replies(2): >>39448553 #>>39450455 #

57. hipadev23 ◴[20 Feb 24 21:58 UTC] No.39447513{4}[source]▶

>>39447200 #

digitalocean squad

replies(1): >>39449627 #

58. adgjlsfhk1 ◴[20 Feb 24 22:25 UTC] No.39447819{5}[source]▶

>>39446940 #

Ethernet cables are physical...

replies(2): >>39447896 #>>39450138 #

59. sargun ◴[20 Feb 24 22:25 UTC] No.39447821{4}[source]▶

>>39447200 #

Droplets are what EC2 calls their hosts. Confusing? I know.

replies(2): >>39447828 #>>39462378 #

60. tptacek ◴[20 Feb 24 22:26 UTC] No.39447828{5}[source]▶

>>39447821 #

Yes! That is confusing! Tell them to stop it!

replies(1): >>39448983 #

61. ddorian43 ◴[20 Feb 24 22:29 UTC] No.39447849{3}[source]▶

>>39446840 #

Next thing the other clouds will offer is cheaper bandwidth pricing, right?

62. dekhn ◴[20 Feb 24 22:34 UTC] No.39447896{6}[source]▶

>>39447819 #

The NIC is attached to the host bus through the north bridge. But other hosts on the same ethernetwork are not considered to be "local". We dont need to get crazy about teh semantics to know that when a cloud provider says an SSD is locally attached, that it's closer than an ethernetwork away.

63. bravetraveler ◴[20 Feb 24 22:54 UTC] No.39448056{6}[source]▶

>>39446222 #

Less fancy, quite often... at least on VPS providers [1]. They like to use reflinked files off the base images. This way they only store what differs.

1: Which is really a cloud without a certain degree of software defined networking/compute/storage/whatever.

64. bravetraveler ◴[20 Feb 24 22:56 UTC] No.39448073{6}[source]▶

>>39446886 #

The storage driver may have more impact on VBox. You can get very impressive results with 'virtio' on KVM

replies(1): >>39448690 #

65. bravetraveler ◴[20 Feb 24 22:58 UTC] No.39448087{6}[source]▶

>>39447488 #

Files with reflinks are a common choice, the main benefit being: only storing deltas. The base OS costs basically nothing

LVM/block like you suggest is a good idea. You'd be surprised how much access time is trimmed by skipping another filesystem like you'd have with a raw image file

66. lolc ◴[20 Feb 24 23:39 UTC] No.39448439{4}[source]▶

>>39447445 #

I tend some workloads that transform data grids of varying sizes. The grids are anon mmaps so that when mem runs out, they get paged out. This means processing stays mostly in-mem yet won't abort when mem runs tight. The processes that get hit by paging slow to a crawl though. Getting faster SSD means they're still crawling but crawling faster. Doubling SSD throughput would pretty much half the tail latency.

replies(1): >>39448632 #

67. wmf ◴[20 Feb 24 23:56 UTC] No.39448553{4}[source]▶

>>39447509 #

vcpu performance is still locked to the typical core from 10 years ago

No. In some cases I think AWS actually buys special processors that are clocked higher than the ones you can buy.

replies(2): >>39449463 #>>39450390 #

68. solardev ◴[21 Feb 24 00:08 UTC] No.39448632{5}[source]▶

>>39448439 #

I see. Thanks for explaining!

69. magicalhippo ◴[21 Feb 24 00:15 UTC] No.39448690{7}[source]▶

>>39448073 #

Yeah I've yet to try that. I know I get a similar lack of performance with Bhyve (FreeBSD) using VirtIO, so it's not a given it's fast.

I have no idea how AWS run their VMs, was just saying a slow storage driver could give such results.

replies(1): >>39449157 #

70. karmakaze ◴[21 Feb 24 00:20 UTC] No.39448728[source]▶

>>39444096 (TP) #

I've run CI/CD pipelines on EC2 machines with local storage, typically running Raid-0, btrfs, noaccestime. I didn't care if the filesystem got corrupt or whatever, I had a script that would rebuild it in under 30mins. In addition to the performance you're not paying by IOPs.

71. mr_toad ◴[21 Feb 24 00:25 UTC] No.39448758{6}[source]▶

>>39444859 #

Data persists between reboots, but not shutdowns:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...

72. kiwijamo ◴[21 Feb 24 00:57 UTC] No.39448983{6}[source]▶

>>39447828 #

FYI it's not a AWS term, it's a DigitalOcean term.

replies(3): >>39449067 #>>39449367 #>>39462407 #

73. tptacek ◴[21 Feb 24 01:10 UTC] No.39449067{7}[source]▶

>>39448983 #

I could not be more confused. Does EC2 quietly call their hosting machines "droplets"? I knew "droplets" to be a DigitalOcean team, but DigitalOcean doesn't have Nitro cards.

replies(2): >>39449214 #>>39449914 #

74. ReflectedImage ◴[21 Feb 24 01:12 UTC] No.39449079{4}[source]▶

>>39447445 #

Pretty much all work loads, work loads that are not affected would be the exception

75. bravetraveler ◴[21 Feb 24 01:24 UTC] No.39449157{8}[source]▶

>>39448690 #

> just saying a slow storage driver could give such results

Oh, absolutely - not to contest that! There's a whole lot of academia on 'para-virtualized' and so on in this light.

That's interesting to hear about FreeBSD; basically all of my experience has been with Linux/Windows.

76. apitman ◴[21 Feb 24 01:33 UTC] No.39449214{8}[source]▶

>>39449067 #

Now I'm wondering if that's where DO got the name in the first place

replies(1): >>39449620 #

77. Hewitt821 ◴[21 Feb 24 01:53 UTC] No.39449306{3}[source]▶

>>39444373 #

Local SSD is part of the machine, not network attached.

78. Hewitt821 ◴[21 Feb 24 01:55 UTC] No.39449322{3}[source]▶

>>39445175 #

Local SSD is part of the machine.

79. sargun ◴[21 Feb 24 02:03 UTC] No.39449367{7}[source]▶

>>39448983 #

I believe AWS was calling them droplets prior to digital ocean.

80. gowld ◴[21 Feb 24 02:19 UTC] No.39449463{5}[source]▶

>>39448553 #

You are talking about real CPU not virtual cpu

replies(1): >>39449912 #

81. stingraycharles ◴[21 Feb 24 02:31 UTC] No.39449527{5}[source]▶

>>39446809 #

You are correct, and the parent you’re replying to is confused. Nitro is for EBS, not the i3 local NVMe instances.

Those i3 instances lose your data whenever you stop and start them again (ie migrate to a different host machine), there’s absolutely no reason they would use network.

EBS itself uses a different network than the “normal” internet, if I were to guess it’s a converged Ethernet network optimized for iSCSI. Which is what Nitro optimizes for as well. But it’s not relevant for the local NVMe storage.

replies(1): >>39455152 #

82. pclmulqdq ◴[21 Feb 24 02:42 UTC] No.39449574{5}[source]▶

>>39447308 #

That is what I am trying to say without actually giving it out. PCIe switches are very much not transparent devices. Apparently AWS has not published anything about this, and doesn't have Nitro moderating access to "local" SSD, though - that I did get confused with EBS.

replies(3): >>39450226 #>>39450902 #>>39457669 #

83. chatmasta ◴[21 Feb 24 02:50 UTC] No.39449620{9}[source]▶

>>39449214 #

Surely "droplet" is a derivative of "ocean?"

replies(1): >>39449858 #

84. jbnorth ◴[21 Feb 24 02:51 UTC] No.39449627{5}[source]▶

>>39447513 #

No, that’s AWS.

85. jbnorth ◴[21 Feb 24 02:54 UTC] No.39449642{3}[source]▶

>>39446840 #

This is already a thing. AWS instance store volumes are directly attached to the host. I’m pretty sure GCP and Azure also have an equivalent local storage option.

86. jandrewrogers ◴[21 Feb 24 03:20 UTC] No.39449776{4}[source]▶

>>39447445 #

I run very large database-y workloads. Storage bandwidth is by far the throughput rate limiting factor. Cloud environments are highly constrained in this regard and there is a mismatch between the amount of CPU you are required to buy to get a given amount of bandwidth. I could saturate a much faster storage system with a fraction of the CPU but that isn’t an option. Note that latency is not a major concern here.

This has an enormous economic impact. I once did a TCO study with AWS to run data-intensive workload running on purpose-built infrastructure on their cloud. AWS would have been 3x more expensive per their own numbers, they didn’t even argue it. The main difference is that we had highly optimized our storage configuration to provide exceptional throughput for our workload on cheap hardware.

I currently run workloads in the cloud because it is convenient. At scale though, the cost difference to run it on your own hardware is compelling. The cloud companies also benefit from a learned helplessness when it comes to physical infrastructure. Ironically, it has never been easier to do a custom infrastructure build, which companies used to do all the time, but most people act like it is deep magic now.

replies(1): >>39450050 #

87. arrakeenrevived ◴[21 Feb 24 03:35 UTC] No.39449858{10}[source]▶

>>39449620 #

Clouds (like, the big fluffy things in the sky) are made up of many droplets of liquid. Using "droplet" to refer to the things that make up cloud computing is a pretty natural nickname for any cloud provider, not just DO. I do imagine that DO uses "droplet" as a public product branding because it works well with their "Ocean" brand, though.

...now I'm actually interested in knowing if "droplet" is derived from "ocean", or if "Digital Ocean" was derived from having many droplets (which was derived from cloud). Maybe neither.

replies(1): >>39455023 #

88. wmf ◴[21 Feb 24 03:44 UTC] No.39449912{6}[source]▶

>>39449463 #

Generally each vCPU is a dedicated hardware thread, which has gotten significantly faster in the last 10 years. Only lambdas, micros, and nanos have shared vCPUs and those have probably also gotten faster although it's not guaranteed.

replies(1): >>39450872 #

89. ◴[21 Feb 24 03:45 UTC] No.39449914{8}[source]▶

>>39449067 #

90. solardev ◴[21 Feb 24 04:09 UTC] No.39450050{5}[source]▶

>>39449776 #

Thanks for the details!

Does this mean you're colocating your own server in a data center somewhere? Or do you have your own data center/running it off a bare metal server with a business connection?

Just wondering if the TCO included the same levels of redundancy and bandwidth, etc.

replies(1): >>39450828 #

91. SteveNuts ◴[21 Feb 24 04:25 UTC] No.39450138{6}[source]▶

>>39447819 #

If that’s the game we’re going to play then technically my driveway is on the same road as the White House.

replies(1): >>39450275 #

92. pzb ◴[21 Feb 24 04:40 UTC] No.39450226{6}[source]▶

>>39449574 #

AWS has stated that there is a "Nitro Card for Instance Storage"[0][1] which is a NVMe PCIe controller that implements transparent encryption[2].

I don't have access to an EC2 instance to check, but you should be able to see the PCIe topology to determine how many physical cards are likely in i4i and im4gn and their PCIe connections. i4i claims to have 8 x 3,750 AWS Nitro SSD, but it isn't clear how many PCIe lanes are used.

Also, AWS claims "Traditionally, SSDs maximize the peak read and write I/O performance. AWS Nitro SSDs are architected to minimize latency and latency variability of I/O intensive workloads [...] which continuously read and write from the SSDs in a sustained manner, for fast and more predictable performance. AWS Nitro SSDs deliver up to 60% lower storage I/O latency and up to 75% reduced storage I/O latency variability [...]"

This could explain the findings in the article - they only meared peak r/w, not predictability.

[0] https://perspectives.mvdirona.com/2019/02/aws-nitro-system/ [1] https://aws.amazon.com/ec2/nitro/ [2] https://d1.awsstatic.com/events/reinvent/2019/REPEAT_2_Power...

93. jsolson ◴[21 Feb 24 04:43 UTC] No.39450240{4}[source]▶

>>39444326 #

In most cases, they're physically plugged into a PCIe CEM slot in the host.

There is no network in the way, you are either misinformed or thinking of a different product.

94. rwiggins ◴[21 Feb 24 04:47 UTC] No.39450257{6}[source]▶

>>39445512 #

GCE local SSDs absolutely are on the same host as the VM. The docs [0] are pretty clear on this, I think:

> Local SSD disks are physically attached to the server that hosts your VM.

Disclosure: I work on GCE.

[0] https://cloud.google.com/compute/docs/disks/local-ssd

95. adgjlsfhk1 ◴[21 Feb 24 04:51 UTC] No.39450275{7}[source]▶

>>39450138 #

exactly. it's not about what's good for the consumer, it's about what they can do without losing a lawsuit for false advertising.

96. phanimahesh ◴[21 Feb 24 05:14 UTC] No.39450390{5}[source]▶

>>39448553 #

The parent claims that though aws uses better hardware, they bill in vcpus whose benchmarks are from a few years ago, so that they can sell more vcpu units per performant physical cpu. This does not contradict your claim that aws buys better hardware.

replies(1): >>39450513 #

97. stingraycharles ◴[21 Feb 24 05:16 UTC] No.39450406{5}[source]▶

>>39447443 #

The parent thinks that AWS' i3 NVMe local instance storage is using a PCIe switch, which is not the case. EBS (and the AWS Nitro card) use a PCIe switch, and as such all EBS storage is exposed as e.g. /dev/nvmeXnY . But that's not the same as the i3 instances are offering, so the parent is confused.

98. flaminHotSpeedo ◴[21 Feb 24 05:27 UTC] No.39450455{4}[source]▶

>>39447509 #

> there is a reason why vcpu performance is still locked to the typical core from 10 years ago

That is transparently nonsense.

You can disprove that claim in 5 minutes, and it makes literally zero sense for offerings that aren't oversubscribed

99. flaminHotSpeedo ◴[21 Feb 24 05:33 UTC] No.39450483{5}[source]▶

>>39445385 #

They're claiming so, but they're wrong.

100. wmf ◴[21 Feb 24 05:39 UTC] No.39450513{6}[source]▶

>>39450390 #

It's so obviously wrong that I can't really explain it. Maybe someone else can. To believe that requires a complete misunderstanding of IaaS.

replies(1): >>39462313 #

101. seedless-sensat ◴[21 Feb 24 06:39 UTC] No.39450805{4}[source]▶

>>39444326 #

Why are you protecting Google's internal architecture onto to AWS? Your Google mental model is not correct here

102. jandrewrogers ◴[21 Feb 24 06:46 UTC] No.39450828{6}[source]▶

>>39450050 #

We were colocated in large data centers right on the major IX with redundancy. All of this was accounted for in their TCO model. We had a better switch fabric than is typical for the cloud but that didn’t materially contribute to cost. We were using AWS for overflow capacity when we exceeded the capacity of our infrastructure at the time; they wanted us to move our primary workload there.

The difference in cost could be attributed mostly to the server hardware build, and to a lesser extent the better scalability with a better network. In this case, we ended up working with Quanta on servers that had everything we needed and nothing we didn’t, optimizing heavily for bandwidth/$. We worked directly with storage manufacturers to find SKUs that stripped out features we didn’t need and optimized for cost per byte given our device write throughput and durability requirements. They all have hundreds of custom SKUs that they don’t publicly list, you just have to ask. A hidden factor is that the software was designed to take advantage of hardware that most enterprises would not deign to use for high-performance applications. There was a bit of supply chain management but we did this as a startup buying not that many units. The final core server configuration cost us just under $8k each delivered, and it outperformed every off-the-shelf server for twice the price and essentially wasn’t something you could purchase in the cloud (and still isn’t). These servers were brilliant, bulletproof, and exceptionally performant for our use case. You can model out the economics of this and the zero-crossing shows up at a lower burn rate than I think many people imagine.

We were extremely effective at using storage, and we did not attach it to expensive, overly-powered servers where the CPUs would have been sitting idle anyway. The sweet spot was low-clock high-core CPUs, which are typically at a low-mid price point but optimal performance-per-dollar if you can effectively scale software to the core count. Since the software architecture was thread-per-core, the core count was not a bottleneck. The economics have not shifted much over time.

AWS uses the same pricing model as everyone else in the server leasing game. Roughly speaking, you model your prices to recover your CapEx in 6 months of utilization. Ignoring overhead, doing it ourselves pulled that closer to 1.5-2 months for the same burn. This moves a lot of the cost structure to things like power, space, and bandwidth. We definitely were paying more for space and power than AWS (usually less for bandwidth) but not nearly enough to offset our huge CapEx advantage relative to workload.

All of this can be modeled out in Excel. No one does it anymore but I am from a time when it was common, so I have that skill in my back pocket. It isn’t nearly as much work as it sounds like, much of the details are formulaic. You do need to have good data on how your workload uses hardware resources to know what to build.

replies(1): >>39451474 #

103. jandrewrogers ◴[21 Feb 24 06:56 UTC] No.39450872{7}[source]▶

>>39449912 #

In fairness, there are a not insignificant number of workloads that do not benefit from hardware threads on CPUs [0], instead isolating processes along physical cores for optimal performance.

[0] Assertion not valid for barrel processors.

104. reactordev ◴[21 Feb 24 06:57 UTC] No.39450882{3}[source]▶

>>39444373 #

AWS is so large, every concept of hardware is virtualized over a software layer. “Instance storage” is no different. It’s just closer to the edge with your node. It’s not some box in a rack where some AWS tech slots in an SSD. AWS has a hardware layer, but you’ll never see it.

105. rowanG077 ◴[21 Feb 24 07:00 UTC] No.39450902{6}[source]▶

>>39449574 #

Why are you acting as if PCIe switches are some secret technology? It was extremely grating for me to read your comments.

replies(2): >>39453547 #>>39453766 #

106. fcsp ◴[21 Feb 24 07:47 UTC] No.39451140{5}[source]▶

>>39446809 #

I see wiggle room in the statement you posted in that the SSD storage that is physically inside the machine hosting the instance might be mounted into the hypervised instance itself via some kind of network protocol still, adding overhead.

replies(1): >>39459703 #

107. vidarh ◴[21 Feb 24 08:42 UTC] No.39451474{7}[source]▶

>>39450828 #

> All of this can be modeled out in Excel. No one does it anymore but I am from a time when it was common, so I have that skill in my back pocket. It isn’t nearly as much work as it sounds like, much of the details are formulaic. You do need to have good data on how your workload uses hardware resources to know what to build.

And this is one of the big "screts" AWS success: Shifting a lot of resource allocation and power from people with budgeting responsibility to developers who have usually never seen the budget or accounts, don't keep track, and at most retrospectively gets pulled in to explain line items in expenses, and obscuring it (to the point where I know people who've spent 6 figure amounts worth of dev time building analytics to figure out where their cloud spend goes... tooling has gotten better but is still awful)

I believe a whole lot of tech stacks would look very different if developers and architects were more directly involved in budgeting, and bonuses etc. were linked at least in part to financial outcomes affected by their technical choices.

A whole lot of claims to low cloud costs come from people who have never done actual comparisons and who seem to have a pathological fear of hardware, even when for most people you don't need to ever touch a physical box yourself - you can get maybe 2/3's of the savings with managed hosting as well.

You don't get the super-customized server builds, but you do get far more choice than with cloud providers, and you can often make up for the lack of fine-grained control by being able to rent/lease them somewhere where the physical hosting is cheaper (e.g. at a previous employer what finally made us switch to Hetzner for most new capacity was that while we didn't get exactly the hardware we wanted, we got "close enough" coupled with data centre space in their locations in Germany being far below data centre space in London - it didn't make them much cheaper, but it did make them sufficiently cheaper to outweigh the hardware differences with a margin sufficient for us to deploy new stuff there but still keep some of our colo footprint)

108. Rudisimo ◴[21 Feb 24 13:33 UTC] No.39453545{4}[source]▶

>>39447200 #

That is more than likely a team-specific term being used outside of its context. FYI, the only place where you will find the term <droplet> used, is in the public-facing AWS EC2 API documentation under InstanceTopology:networkNodeSet[^1]. Even that reference seems like a slip of the tongue, but the GP did mention working on the Nitro team, which makes sense when you look at the EC2 instance topology[^2].

[^1]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_I... [^2]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/how-ec2-...

109. stingraycharles ◴[21 Feb 24 13:34 UTC] No.39453547{7}[source]▶

>>39450902 #

Because the parent works/worked for Google, so obviously it must be super secret sauce that nobody has heard of. /s

Next up they’re going explain to us that iSCSI wants us to think it’s SCSI but it’s actually not!

110. the-rc ◴[21 Feb 24 13:51 UTC] No.39453766{7}[source]▶

>>39450902 #

Although it used them for years, the first mention by Google of PCIe switches was probably in the 2022 Aquila paper, which doesn't really talk about storage anyway...

replies(1): >>39457173 #

111. rcarmo ◴[21 Feb 24 14:21 UTC] No.39454138{5}[source]▶

>>39445944 #

This is the way Azure temporary volumes work as well. They are scrubbed off the hardware once the VM that accesses them is dead. Everything else is over the network.

112. driftnet ◴[21 Feb 24 15:32 UTC] No.39455023{11}[source]▶

>>39449858 #

Clouds are water vapor, not droplets.

replies(2): >>39455368 #>>39518545 #

113. jethro_tell ◴[21 Feb 24 15:41 UTC] No.39455138{6}[source]▶

>>39447488 #

Heh, you'd probably be surprised, there's some really cool cutting edge stuff being done in those data centers but a lot of what is done is just plan old standard server management without much in the way of tricks. Its just someone else does it instead of you and the billing department is counting milliseconds.

replies(1): >>39455931 #

114. MichaelZuo ◴[21 Feb 24 15:42 UTC] No.39455152{6}[source]▶

>>39449527 #

The argument could also be resolved by just getting the latency numbers for both cases and compare them, on bare metal it shouldn't be more than a few hundred nanoseconds.

115. abadpoli ◴[21 Feb 24 15:55 UTC] No.39455368{12}[source]▶

>>39455023 #

“Cloud: Visible mass of liquid droplets or frozen crystals suspended in the atmosphere“

https://en.wikipedia.org/wiki/Cloud

116. icedchai ◴[21 Feb 24 16:32 UTC] No.39455931{7}[source]▶

>>39455138 #

Do cloud providers document these internals anywhere? I'd love to read about that sort of thing.

replies(1): >>39456966 #

117. jethro_tell ◴[21 Feb 24 17:45 UTC] No.39456966{8}[source]▶

>>39455931 #

Not generally, especially not the super generic stuff. Where they really excel is having the guy that wrote the kernel driver or hypervisor on staff. But a lot of it is just an automated version of what you'd do on a smaller scale

118. rowanG077 ◴[21 Feb 24 18:00 UTC] No.39457173{8}[source]▶

>>39453766 #

I don't understand why you would expect Google to state that. They have been standard technology for almost 2 decades. You don't see google claiming they use jtag or using SPI flash or whatever. It's just not special.

replies(1): >>39462921 #

119. dekhn ◴[21 Feb 24 18:39 UTC] No.39457669{6}[source]▶

>>39449574 #

Like many other people in this thread, I think we disagree that a PCI switch means that an SSD "is connected over a network" to the host bus.

Now if you can show me two or more hosts connected to a box of SSDs through a PCI switch (and some sort of cool tech for coordinating between the hosts), that's interesting.

120. eek2121 ◴[21 Feb 24 21:19 UTC] No.39459703{6}[source]▶

>>39451140 #

At minimum, the entire setup will be virtualized, which does add overhead.

121. asalahli ◴[22 Feb 24 02:15 UTC] No.39462313{7}[source]▶

>>39450513 #

GP is probably refferring to blurbs like these

> Amazon SimpleDB measures the machine utilization of each request and charges based on the amount of machine capacity used to complete the particular request (SELECT, GET, PUT, etc.), normalized to the hourly capacity of a circa 2007 1.7 GHz Xeon processor. See below for a more detailed description of how machine utilization charges are calculated.

https://aws.amazon.com/simpledb/pricing/

replies(1): >>39463018 #

122. yolovoe ◴[22 Feb 24 02:22 UTC] No.39462378{5}[source]▶

>>39447821 #

Yes, we internally call servers droplets. We have multiple hosts/mobos in the same server these days so calling them hosts is confusing, and droplet is a really old term here from what i can tell.

123. yolovoe ◴[22 Feb 24 02:25 UTC] No.39462407{7}[source]▶

>>39448983 #

It’s an internal ec2 term too. We don’t use it externally and I shouldn’t have used it to avoid all this confusion.

Internally, we say droplet instead of host as there are multiple hosts/mobos per droplet these days. It’s no longer true that when you get a metal droplet, you get the entire droplet.

124. yolovoe ◴[22 Feb 24 02:30 UTC] No.39462446{4}[source]▶

>>39447200 #

I shouldn’t have said droplet. Like sibling says, that’s our internal name for a “server” and not what we use externally.

125. the-rc ◴[22 Feb 24 03:36 UTC] No.39462921{9}[source]▶

>>39457173 #

Google didn't invent the Clos network, either, but it took years before they started talking about its adoption and with what kind of proprietary twists. Same with power supplies. You're right, a PCIe switch is not special, unless maybe it's integrated in some unconventional way. It's in Google's DNA to be cagey by default on a lot of details, to avoid giving ideas to the competition. Or misleading others down rabbit holes, like with shipping container datacenters.

replies(1): >>39466137 #

126. wmf ◴[22 Feb 24 03:48 UTC] No.39463018{8}[source]▶

>>39462313 #

SimpleDB is over 15 years old. I guess it's the only service still using "normalized" pricing. Newer services like RDS tell you exactly which processor you're getting and how many cores.

127. sitkack ◴[22 Feb 24 12:29 UTC] No.39466137{10}[source]▶

>>39462921 #

No, it dismisses technology until it does a 180 and then pretends it innovated in ways everyone is too stupid to understand. Google exceptionalism 101.

128. singleshot_ ◴[27 Feb 24 00:23 UTC] No.39518545{12}[source]▶

>>39455023 #

Clouds are condensed water droplets in the air. The air below the cloud has just about the same amount of water in it, but at the altitude of the bottom of the cloud, the atmosphere is cool enough for that water vapor to condense, forming the cloud.

Search terms include “lapse rate” if you would like to learn more.

↑