Sounds like one more excuse for AWS to obfuscate any meaning in their billing structure and take control of the narrative.
How much are they getting away with by virtualization. (Think how banks use your money for loans and stuff)
You actually don't get to really see the internals other than IOPS which doesn't help when it's gatekept already.
When I look at cloud, I get to think "finally! No more hardware to manage. No OS to manage". It's the best thing about the cloud, provided your workload is amenable to PaaS. It's great because I don't have to manage Windows or IIS. Microsoft does that part for me and significantly cheaper than it would be to employ me to do that work.
SSDs in the cloud are attached over a network, and fundamentally have to be. The problem is that this network is so large and slow that it can't give you anywhere near the performance of a local SSD. This wasn't a problem for hard drives, which was the backing technology when a lot of these network attached storage systems were invented, because they are fundamentally slow compared to networks, but it is a problem for SSD.
I am but an end user, but I noticed that disk IO for a certain app was glacial compared to a local test deployment, and I chalked it up to networking/VM overhead
Literally depending on where things are in a data center... If you're looking for closely coupled and on a 10G line on the same switch, going to the same server rack. I bet you performance will be so much more consistent.
AWS marketing claims otherwise:
Up to 800K random write IOPS
Up to 1 million random read IOPS
Up to 5600 MB/second of sequential writes
Up to 8000 MB/second of sequential reads
https://aws.amazon.com/blogs/aws/new-storage-optimized-amazo...Of course then you have a single point of failure, in the PCIe fabric of the machine you're running on if not the NVMe itself. But if you have good backups, which you should, then the juice really isn't worth the squeeze for NAS storage.
RAID helps you locally, but fundamentally relies on locality and low latency (and maybe custom hardware) to minimize the time window where you get true data corruption on a bad disk. That is insufficient for cloud storage.
It basically allows me to forego having to make a server for the CRUD operations so I can focus on the actual business implications. My REST API is automatically managed for me (mostly with lightweight views and functions) and all of my other core logic is either spread out through edge functions or in a separate data store (Redis) where I perform the more CPU intensive operations related to my business.
There's some rough edges around their documentation and DX but I'm really loving it so far.
The fastest SSDs tend to also be MLC which tend to have much lower write life vs other technologies. This isn't unusual, increasing data density generally also makes it easier to increase performance. However, it's at the cost that the writes are typically done for a block/cell in memory rather than for single bits. So if one cell goes bad, they all fail.
But even if that's not the problem, there is a problem of upgrading the fleet in a cost effective mechanism. When you start introducing new tech into the stack, replacing that tech now requires your datacenters to have 2 different types of hardware on hand AND for the techs swapping drives to have a way to identify and replace that stuff when it goes bad.
Also, the article isn't just about SSDs being no faster than a network. It's about SSDs being two orders of magnitude slower than datacenter networks.
10GbE is about the best you can hope for from a local network these days, but that's 1/5th the bandwidth and many times the latency. 100GbE would work, except the latency would still mean any read dependencies would be far slower than local storage, and I'm not sure there's much to be done about that; at these speeds the physical distance matters.
In practice I'm having to architecture the entire system around the SSD just to not bottleneck it. So far ext4 is the only filesystem that even gets close to the SSD's limits, which is a bit of a pity.
What happens when one tenant needs 200TB attached to a server?
Cloud providers are starting to offer local SSD/NVMe, but you're renting the entire machine, and you're still limited to exactly what's installed in that server.
This is the theory that I would bet on because it lines up with their bottom line.
Along with that, ChatGPT has knocked down most of the remaining barriers I have had when permissions get confusing in one of the cloud services.
Samsung 990 Pro 2TB has a latency of 40 μs
DDR4-2133 with a CAS 15 has a latency of 14 nano seconds.
DDR4 latency is 0.035% of one of the fastest SSDs, or to put it another way, DDR4 is 2,857x faster than an SSD.
L1 cache is typically accessible in 4 clock cycles, in 4.8 ghz cpu like the i7-10700, L1 cache latency is sub 1ns.
When you rent a bare metal server, you don't manage your hardware either. The failed parts are replaced for you. Unless you can't figure out what hardware configuration you need - which would be a really big red flag for your level of expertise.
It's no wonder that many people nowadays, esp. those who are so young that they've never experienced anything but cloud instances, seem to have little idea of how much performance you can actually pack in just one or two RUs today. Ultra-fast (I'm not parroting some marketing speak here - I just take a look at IOPS numbers, and compare them to those from highest-end storage some 10-12 years ago) NVMe storage is a big part of that astonishing magic.
- https://cloud.google.com/compute/docs/disks/local-ssd
- https://learn.microsoft.com/en-us/azure/virtual-machines/ena...
- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-inst...
They sell these VMs at a higher cost because it requires more expensive components and is limited to host machines with certain configurations. In our experience, it's also harder to request quota increases to get more of these VMs -- some of the public clouds have a limited supply of these specific types of configurations in some regions/zones.
As others have noted, instance storage isn't as dependable. But it can be the most performant way to do IO-intense processing or to power one node of a distributed database.
Networked storage negates that significantly, absolutely killing performance for certain applications. You could have a 100Gbps network and it still won't match a direct-attached SSD in terms of latency (it can only match it in terms of sequential access throughput).
For many applications such as databases, random access is crucial, thus why nowadays' mid-range consumer hardware often outperforms hosted databases such as RDS unless they're so overprovisioned on RAM that the dataset is effectively always in there.
We offer faster NVMe drives in instances. Our E4 Dense shapes ship with SAMSUNG MZWLJ7T6HALA-00AU3, which supports Sequential Reads of 7000 MB/s, and Sequential Write 3800 MB/s.
From a general perspective, I would say the _likely_ answer to why AWS doesn't have faster NVMes at the moment is likely to be lack of specific demand. That's a guess, but that's generally how things go. If there's not enough specific demand being fed in through TAMs and the like for faster disks, upgrades are likely to be more of an after-thought, or reflecting supply chain.
I know there's a tendency when you engineer things, to just work around, or work with the constraints, and grumble amongst your team, but it's incredibly invaluable if you can make sure your account manager knows what shortcomings you've had to work around.
tangent, I remember reading some post called something like "Latency numbers every programmer should know" and being slightly ashamed when I could not internalize it.
However, individual servers may still operate at 10, 25, or 40 Gbps to save cost on the thousands of NICs in a row of racks. Alternatively, servers with multiple 100G connections split that bandwidth allocation up among dozens of VMs so each one gets 1 or 10G.
Also, adding, if I am on an 8 core machine and need 16, network storage can be detached from host A and connected to host B. In dedicated storage it must be fully copied over first.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Instance...
Could that have to do with every operation requiring a round trip, rather than being able to queue up operations in a buffer to saturate throughput?
It seems plausible if the interface protocol was built for a device it assumed was physically local and so waited for confirmation after each operation before performing the next.
In this case it's not so much the throughput rate that matters, but the latency -- which can also be heavily affected by buffering of other network traffic.
> Many Amazon EC2 instances can also include storage from devices that are located inside the host computer, referred to as instance storage.
There might be some wiggle room in "physically attached", but there's none in "storage devices located inside the host computer". It's not some kind of AWS-only thing either. GCP has "local SSD disks"[1], which I'm going to claim are likewise local, not over the network block storage. (Though the language isn't as explicit as for AWS.)
Are you referring to PD-SSD? Internal storage usage?
“”
Each storage volume can deliver the following performance (all measured using 4 KiB blocks):
* Up to 8000 MB/second of sequential reads
“”> However, this does not explain why the read bandwidth is stuck at 2 GB/s.
Faster read speeds would give them a more enticing product without wearing drives out.
There are a bunch of supporting social trends toward this as well. Renewed emphasis on privacy. Big Tech canceling beloved products, bricking devices, and generally enshittifying everything - a lot of people want locally-controlled software that isn't going to get worse at the next update. Ever-rising prices which make people want to lock in a price for the device and not deal with increasing rents for computing power.
I thought cloud was supposed to abstract this away? That's a bit of a sarcastic question from a long-time cloud skeptic, but... wasn't it?
If one CPU attached to storage dies, another can take over and reattach -- or vice-versa. If one network link dies, it can be rerouted around.
The problem is that ultimately your application often requires the outcome of a given IO operation to decide which operation to perform next - let's say when it comes to a database, it should first read the index (and wait for that to complete) before it knows the on-disk location of the actual row data which it needs to be able to issue the next IO operation.
In this case, there's no other solution than to move that application closer to the data itself. Instead of the networked storage node being a dumb blob storage returning bytes, the networked "storage" node is your database itself, returning query results. I believe that's what RDS Aurora does for example, every storage node can itself understand query predicates.
Here's referral link with free credits: https://upcloud.com/signup/?promo=J3JYWZ
Otherwise it'd stay on the same physical piece of hardware it was allocated to when new.
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
loop0 7:0 0 24.9M 1 loop /snap/amazon-ssm-agent/7628
loop1 7:1 0 55.7M 1 loop /snap/core18/2812
loop2 7:2 0 63.5M 1 loop /snap/core20/2015
loop3 7:3 0 111.9M 1 loop /snap/lxd/24322
loop4 7:4 0 40.9M 1 loop /snap/snapd/20290
nvme0n1 259:0 0 8G 0 disk
├─nvme0n1p1 259:1 0 7.9G 0 part /
├─nvme0n1p14 259:2 0 4M 0 part
└─nvme0n1p15 259:3 0 106M 0 part /boot/efi
nvme2n1 259:4 0 3.4T 0 disk
nvme4n1 259:5 0 3.4T 0 disk
nvme1n1 259:6 0 3.4T 0 disk
nvme5n1 259:7 0 3.4T 0 disk
nvme7n1 259:8 0 3.4T 0 disk
nvme6n1 259:9 0 3.4T 0 disk
nvme3n1 259:10 0 3.4T 0 disk
nvme8n1 259:11 0 3.4T 0 disk
Since nvme0n1 is the EBS boot volume, we have 8 SSDs. And here's the read bandwidth for one of them: $ sudo fio --name=bla --filename=/dev/nvme2n1 --rw=read --iodepth=128 --ioengine=libaio --direct=1 --blocksize=16m
bla: (g=0): rw=read, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=libaio, iodepth=128
fio-3.28
Starting 1 process
^Cbs: 1 (f=1): [R(1)][0.5%][r=2704MiB/s][r=169 IOPS][eta 20m:17s]
So we should have a total bandwidth of 2.7*8=21 GB/s. Not that great for 2024."Hardware degradation detected, please turn it off and back on again"
I could do a migration with zero downtime in VMware for a decade but they can't seamlessly move my VM to a machine that works in 2024? Great, thanks. Amusing.
Unfortunately very few actually think about failure modes, set realistic targets, and actually test the process. Everyone thinks they need 100% uptime and consistency, few actually achieve it in practice (many think they do, but when shit hits the fan it uncovers an edge-case they haven't thought of), but it turns out that in most cases it doesn't matter and they could've saved themselves a lot of trouble and complexity.
*As implemented in the public cloud providers.
You can absolutely get better than local disk speeds from SAN devices and we've been doing it for decades. To do it on-prem with flash devices will require NVMe over FC or Ethernet and an appropriate storage array. Modern all-flash array performance is measured in millions of IOPS.
Will there be a slight uptick in latency? Sure, but it's well worth it for the data services and capacity of an external array for nearly every workload.
So not only do you spend time on the wrong thing you don't even know how it works. And the providers goals are not aligned either as all they care about is locking you in.
How is that better?
> we are still stuck with 2 GB/s per SSD
Versus the ~2.7 GiB/s your benchmark shows (bit hard to know where to look on mobile with all that line-wrapped output, and when not familiar with the fio tool; not your fault but that's why I'm double checking my conclusion)
You need to migrate that data if you replace an entire server, but this usually isn’t a very big deal.
But they also tell you how much IOPS you get: https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/stora...
Search for i3en API names and you'll see:
i3en.large, 2x CPU, 1250GB SSD
i3en.xlarge, 4x CPU, 2500GB SSD
i3en.2xlarge, 8x CPU, 2x2500GB SSD
i3en.3xlarge, 12x CPU, 7500GB SSD
i3en.6xlarge, 24x CPU, 2x7500GB SSD
i3en.12xlarge, 48x CPU, 4x7500GB SSD
i3en.24xlarge, 96x CPU, 8x7500GB SSD
i3en.metal, 96x CPU, 8x7500GB SSD
So they've got servers with 96 CPUs and 8x7500GB SSDs. You can get a slice of one, or you can get the whole one. All of these are the ratio of 625GB of local SSD per CPU core.
On GCP you can get a 2-core N2 instance type and attach multiple local SSDs. I doubt they have many physical 2-core Xeons in their datacenters.
Yeah, you can skip a lot of that if your goal is to get a server online as cheaply as possible, reliability be damned. As soon as you start caring about keeping it in a business-ready state, costs start to skyrocket.
I've worn the sysadmin hat. If AWS burned down, I'd be ready and willing to recreate the important parts locally so that my company could stay in business. But wow, would they ever be in for some sticker shock.
So there is demand, but I'm certainly not interested in paying many multiples of 50 euros over an expected lifespan of a few years, so it may not make economic sense for them to offer it to users like me at least. On the other hand, for the couple hours this should have taken (rather than the days it initially did), I'd certainly have been willing to pay that cloud premium and that's why I tried to get me one of these allegedly SSD-backed VPSes... but now that I have a fast system permanently, I don't think that was a wise decision of past me
The self hosted version of this is GlusterFS and Ceph, which have the same dynamics as EBS and its equivalents in other cloud providers.
Same VM is not allocated for a variety of reasons , scheduled maintenance, proximity to other hosts on the vpc , balancing quiet and noisy neighbors so on.
It is not that the disk will always wiped , sometimes the data is still there on reboot just that there is no guarantee allowing them to freely move between hosts
You should also see how they enforce similar things for their other products and APIs, for example, most of their services have encrypted pagination tokens.
The amount of complexity the architecture has because of those constraints is insane.
When I worked at my previous job, management kept asking for that scale of designs for less than 1/1000 of the throughput and I was constantly pushing back. There's real costs to building for more scale than you need. It's not as simple as just tweaking a few things.
To me there's a couple of big breakpoints in scale:
* When you can run on a single server
* When you need to run on a single server, but with HA redundancies
* When you have to scale beyond a single server
* When you have to adapt your scale to deal with the limits of a distributed system, i.e. designing for DyanmoDB's partition limits.
Each step in that chain add irrevocable complexity, adds to OE, adds to cost to run and cost to build. Be sure you have to take those steps before you decide too.
Distributed clusters of things can work just fine on ephemeral local storage (aka local storage). A kafka cluster or an opensearch cluster will be fine using instance local storage, for instance.
As with everything else.... "it depends"
Certainly true that SSD bandwidth and latency improvements are hard to match, but I don't understand why intra-datacenter network latency in particular is so bad. This ~2020-I-think version of the "Latency Numbers Everyone Should Know" says 0.5 ms round trip (and mentions "10 Gbps network" on another line). [1] It was the same thing in a 2012 version (that only mentions "1 Gbps network"). [2] Why no improvement? I think that 2020 version might have been a bit conservative on this line, and nice datacenters may even have multiple 100 Gbit/sec NICs per machine in 2024, but still I think the round trip actually is strangely bad.
I've seen experimental networking stuff (e.g. RDMA) that claims significantly better latency, so I don't think it's a physical limitation of the networking gear but rather something at the machine/OS interaction area. I would design large distributed systems significantly differently (be much more excited about extra tiers in my stack) if the standard RPC system offered say 10 µs typical round trip latency.
[1] https://static.googleusercontent.com/media/sre.google/en//st...
Their probably should be more local instance storage types for using with instances that can be recreated without loss. But it is simple for them to have a single way of doing things.
At work, someone used fast NVMe instance storage for Clickhouse which is a database. It was a huge hassle to copy data when instances were going to be restarted because the data would be lost.
At least I have the impression they are lagging, eg., still offering things like: z1d: Skylake (2017) https://aws.amazon.com/ec2/instance-types/z1d/ x2i: Cascade Lake (2019) and Ice lake (2021) https://aws.amazon.com/ec2/instance-types/x2i/
I have not been able to find instances powered by the 4th (Q1 2023) or 5th generation (Q4 2023) Xeons?
We solve large capacity expansion power market models that need as fast single-threaded performance as possible coupled with lots of RAM (32:1 ratio or higher ideal). One model may take 256-512 GB RAM, but not being able to use more than 4 threads effectively (interior point algorithms have very diminishing returns past this point)
Our dispatch models do not have the same RAM requirement, but you still wish to have the fastest single-threaded processors available (and then parallelize)
No they don't. I work for a cloud provider and I can guarantee that your SSD is local to your VM.
I suspect you're thinking of SSD-PD. If "local" SSDs are not actually local and go through a network, I need to have a discussion with my GCS TAM about truth in advertising.
There's a middle-ground between cloud and colocation. There are plenty of providers such as OVH, Hetzner, Equinix, etc which will do all of the above for you.
I suspect the answer is something to do with their manufacturing processes/rack designs. When I worked there (pre GCP) machines had only a tiny disk used for booting and they wanted to get rid of that. Storage was handled by "diskful" machines that had dedicated trays of HDDs connected to their motherboards. If your datacenters and manufacturing processes are optimized for building machines that are either compute or storage but not both, perhaps the more normal cloud model is hard to support and that pushes you towards trying to aggregate storage even for "local" SSD or something.
At the same time, ethernetworks with layered network protocols on top typically have a fair amount of latency overhead, that makes it much slower than bus-based direct-host-attached storage. I was definitely impressed at how quickly SSDs reached and then exceeded SATA bandwidth. nvme has made a HUGE difference here.
> nice datacenters may even have multiple 100 Gbit/sec NICs per machine in 2024,
Google exceeded 100Gbps per machine long before 2024. IIRC it had been 400Gbps for a while.
As a hint for you, I said "a network", not "the network." You can also look at public presentations about how Nitro works.
Actual latency with standard Stubby-over-TCP and warmed channels...it's been a while, so I don't remember the number I observed, but I remember it wasn't that much better than 0.5 ms. It was still bad enough that I didn't want to add a tier that would have helped with isolation in a particularly high-reliability system.
you could easily have like 40GB/s with some over provisioning / bucketing
We must be using different clouds.
For some of the much higher-level services … maybe some semblance of that statement holds. But for VMs? Definitely not "no OS to manage" … the OS is usually on the customer. There might be OS-level agents from your cloud of choice that make certain operations easier … but I'm still on the hook for updates.
Even "No machine" is a stretch, though I've found this is much more dependent on cloud. AWS typically notices failures before I do, and by the time I notice something is up, the VM has been migrated to a new host and I'm none the wiser sans the reboot that cost. But other clouds I've been less lucky with: we've caught host failures well before the cloud provider, to an extent where I've wished there was a "vote of no confidence" API call I could make to say "give me new HW, and I personally think this HW is suss".
Even on higher level services like RDS, or S3, I've noticed failures prior to AWS … or even to the extent that I don't know that AWS would have noticed those failures unless we had opened the support ticket. (E.g., in the S3 case, even though we clearly reported the problem, and the problem was occurring on basically every request, we still had to provide example request IDs before they'd believe us. The service was basically in an outage as far as we could tell … though I think AWS ended up claiming it was "just us".)
That said, S3 in particular is still an excellent service, and I'd happily use it again. But cloud == 0 time on my part. It depends heavily on the cloud, and less heavily on the service how much time, and sometimes, it is still worthwhile.
> The Nitro Cards are physically connected to the system main board and its processors via PCIe, but are otherwise logically isolated from the system main board that runs customer workloads.
https://docs.aws.amazon.com/whitepapers/latest/security-desi...
> In order to make the [SSD] devices last as long as possible, the firmware is responsible for a process known as wear leveling.... There’s some housekeeping (a form of garbage collection) involved in this process, and garden-variety SSDs can slow down (creating latency spikes) at unpredictable times when dealing with a barrage of writes. We also took advantage of our database expertise and built a very sophisticated, power-fail-safe journal-based database into the SSD firmware.
https://aws.amazon.com/blogs/aws/aws-nitro-ssd-high-performa...
This firmware layer seems like a good candidate for the slowdown.
SANs can still be quite fast, and instance storage is fast, both of which are available in cloud providers
If you want long term local storage you'll have to reserve an instance host.
Amazon offers both locally-attached storage devices as well as instance-attached storage devices. The article is about the latter kind.
When you say RAID, what level? Software-raid or hardware raid? What controller?
Let's take best-case:
RAID10, small enough (but many) NVMe drives and an LVM/Software RAID like ZFS, which is data aware so only rebuilds actual data: rebuilds will degrade performance enough potentially that your application can become unavailable if your IOPS are 70%+ of maximum.
That's an ideal scenario, if you use hardware raid which is not data-aware then your rebuild times depend entirely on the size of the drive being rebuilt and it can punish IOPs even more during the rebuild. But it will affect your CPU less.
There's no panacea. Most people opt for higher latency distributed storage where the RAID is spread across an enormous amount of drives, which makes rebuilds much less painful.
This said the problem can get more complex than this really fast. Write barriers for example and dirty caches. Any application that forces writes and the writes are enforced by the kernel are going to suffer.
The same is true for SSD settings. There are a number of tweakable values on SSDs when it comes to write commit and cache usage which can affect performance. Desktop OS's tend to play more fast and loose with these settings and servers defaults tend to be more conservative.
If that’s what’s meant it will be stated in some fine print, if it’s not stated anywhere then there is no guarantee what the term means, except I would guess they may want people to infer things that may not necessarily be true.
Then stuff goes away, or route congested, etc, etc, etc.
Link to this mythical hosting service that expects far less than 200TB of data per client but just pulls a sad face and takes the extra cost on board when a client demands it. :D
The other thing to note about big inter-DC links are heavily QoS'd and contented, because they are both expensive and a bollock to maintain.
Also, from what I recall, 40gig links are just parallel 10 gig links, so have no lower latency. I'm not sure if 100/400 gigs are ten/fourty lines of ten gigs in parallel or actually able to issue packets at 10/40 times a ten gig link. I've been away from networking too long
I'd github can afford the amount of downtime they do, it's likely that your business can afford 15 minutes of downtime every once in a while due to a failing server.
Also, the less servers you have overall, the least common a failure will be.
Backups and cold failover server are mandatory, but anything past that should be weighted on a rational cost/benefit analysis, and for most people the cost/benefit ratio just isn't enough to justify infrastructure complexity.
This post on how Discord RAIDed local NVMe volumes with slower remote volumes might be on interest https://discord.com/blog/how-discord-supercharges-network-di...
Even a very unoptimized application running on a dev laptop can serve 1Gbps nowadays without issues.
So what are the constraints that demand a complex architecture?
- Azure network latency is about 85 microseconds.
- AWS network latency is about 55 microseconds.
- Both can do better, but only in special circumstances such as RDMA NICs in HPC clusters.
- Cross-VPC or cross-VNET is basically identical. Some people were saying it's terribly slow, but I didn't see that in my tests.
- Cross-zone is 300-1200 microseconds due to the inescapable speed of light delay.
- VM-to-VM bandwidth is over 10 Gbps (>1 GB/s) for both clouds, even for the smallest two vCPU VMs!
- Azure Premium SSD v1 latency varies between about 800 to 3,000 microseconds, which is many times worse than the network latency.
- Azure Premium SSD v2 latency is about 400 to 2,000 microseconds, which isn't that much better, because:
- Local SSD caches in Azure are so much faster than remote disk that we found that Premium SSD v1 is almost always faster than Premium SSD v2 because the latter doesn't support caching.
- Again in Azure, the local SSD "cache" and also the local "temp disks" both have latency as low as 40 microseconds, on par with a modern laptop NVMe drive. We found that switching to the latest-gen VM SKU and turning on the "read caching" for the data disks was the magic "go-fast" button for databases... without the risk of losing out data.
We investigated the various local-SSD VM SKUs in both clouds such as the Lasv3 series, and as the article mentioned, the performance delta didn't blow my skirt up, but the data loss risk made these not worth the hassle.
Cloud does not do anything else.
None of these latency/speed problems are cloud-specific. If you have on-premise servers and you are storing your data on network-attached storage, you have the exact same problems (and also the same advantages).
Unfortunately the gap between local and network storage is wide. You win some, you lose some.
> but I remember it wasn't that much better than 0.5 ms.
If you and I still worked at Google I'd just give you an automon dashboard link showing latency an order of magnitude better than that to prove myself…
You've provided cryptic hints and a suggestion to watch some unnamed presentation.
At this point I really think the burden of proof is on you.
In my experience this is also orders of magnitude slower that true direct access, ie PCIe pass-through, as all access has to pass through the VM storage driver and so could explain what is happening.
Physically attached for servers, for the past 20+ years, has meant a direct electrical connection to a host bus (such as the PCI bus attached to the front-side bus). I'd like to see some alternative examples that violate that convention.
The demand for five-nines is greatly exaggerated.
HDD was 10ms, which was noticeable for cached network request that needs to go back out on the wire. This was also bottle necked by IOPS, after 100-150 IOPS you were done. You could do a bit better with raid, but not the 2-3 orders of magnitude you really needed to be an effective cache. So it just couldn't work as a serious cache, the next step up was RAM. This is the operational environment which redis and such memory caches evolved.
40 us latency is fine for caching. Even the high load 500-600us latency is fine for the network request cache purpose. You can buy individual drives with > 1 million read IOPS. Plenty for a good cache. HDD couldn't fit the bill for the above reasons. RAM is faster, no question, but the lower latency of the RAM over the SSD isn't really helping performance here as the network latency is dominating.
Rails conference 2023 has a talk that mentions this. They moved from a memory based cache system to an SSD based cache system. The Redis RAM based system latency was 0.8ms and the SSD based system was 1.2ms for some known system. Which is fine. It saves you a couple of orders of magnitude on cost and you can do much much larger and more aggressive caching with the extra space.
Often times these RAM caching servers are a network hop away anyway, or at least a loopback TCP request. Making the question of comparing SSD latency to RAM totally irrelevant.
Can you expound?
> If you and I still worked at Google I'd just give you an automon dashboard link showing latency an order of magnitude better than that to prove myself…
I believe you, and I think in principle we should all be getting the 50 µs latency you're describing within a datacenter with no special effort.
...but it doesn't match what I observed, and I'm not sure why. Maybe difference of a couple years. Maybe I was checking somewhere with older equipment, or some important config difference in our tests. And obviously my memory's a bit fuzzy by now but I know I didn't like the result I got.
Nope, some require a full backup and restore.
I'm not an insider and don't have any exclusive knowledge, but from reading a lot about the topic my impression is that the issue in both clouds is the virtualization overheads.
That is, having the networking or storage go through any hypervisor software layer is what kills the performance. I've seen similar numbers with on-prem VMware, Xen, and Nutanix setups as well.
Both clouds appear to be working on next-generation VM SKUs where the hypervisor network and storage functions are offloaded into 100% hardware, either into FPGAs or custom ASICs.
"Azure Boost" is Microsoft's marketing name for this, and it basically amounts to both local and remote disks going through an NVMe controller directly mapped into the memory space of the VM. That is, the VM OS kernel talks directly to the hardware, bypassing the hypervisor completely. This is shown in their documentation diagrams: https://learn.microsoft.com/en-us/azure/azure-boost/overview
They're claiming up to 3.8M IOPS for a single VM, which is 3-10x what you'd get out of a single NVMe SSD stick, so... not too shabby at all!
Similarly, Microsoft Azure Network Adapter (MANA) is the equivalent for the NIC, which will similarly connect the VM OS directly into the network, bypassing the hypervisor software.
I'm not an AWS expert, but from what I've seen they've been working on similar tech (Nitro) for years.
Sigh. This old trope from ancient history in internet time.
> Yes, you can probably buy a server for less than the yearly rent on the equivalent EC2 instance.
Or a monthly bill... I can oft times buy a higher performing server for the cost of a rental for a single month.
> But then you've got to put that server somewhere, with reliable power and probably redundant Internet connections
Power:
The power problem is a lot lower with modern systems because they can use a lot less of it per unit of compute/memory/disk performance. Idle power has improved a lot too. You don't need 700 watts of server power anymore for a 2 socket 8 core monster that is outclassed by a modern $400 mini-pc that maxes out at 45 watts.
You can buy server rack batteries now in a modern chemistry that'll go 20 years with zero maintenance. 4U sized 5kwh cost 1000-1500. EVs have pushed battery cost down a LOT. How much do you really need? Do you even need a generator if your battery just carries the day? Even if your power reliability totally sucks?
Network:
Never been easier to buy network transfer. Fiber is available in many places, even cable speeds are well beyond the past, and there's starlink if you want to be fully resistant to local power issues. Sure, get two vendors for redundancy. Then you can hit cloud-style uptimes out of your closet.
Overlay networks like tailscale make the networking issues within the reach of almost anyone.
> Yeah, you can skip a lot of that if your goal is to get a server online as cheaply as possible, reliability be damned
Google cut it's teeth with cheap consumer class white box computers when "best practice" of the day was to buy expensive server class hardware. It's a tried and true method of bootstrapping.
> You have to maintain an inventory of spares, and pay someone to swap it out if it breaks. You have to pay to put its backups somewhere.
Have you seen the size of M.2 sticks? Memory sticks? They aren't very big... I happened to like opening up systems and actually touching the hardware I use.
But yeah, if you just can't make it work or be bothered in the modern era of computing. Then stick with the cloud and the 10-100x premium they charge for their services.
> I've worn the sysadmin hat. If AWS burned down, I'd be ready and willing to recreate the important parts locally so that my company could stay in business. But wow, would they ever be in for some sticker shock.
Nice. But I don't think it cost as much as you think. If you run apps on the stuff you rent and then compare it to your own hardware, it's night and day.
there is a reason why vcpu performance is still locked to the typical core from 10 years ago when every core on a machine today in those data scenters is 3-5x or more speed basis. Its cause they can charge you for 5x the cores to get that gain.
Last time (about a year ago) I ran a couple random IO benchmarks against a storage optimized instances and the random IOPs behavior is closer to a large spinning RAID array than SSDs if the disk size is over some threshold.
IIRC, What it looks like is that there is a fast local SSD cache with a couple hundred GB of storage and then the rest is backed by remote spinning media.
Its one of the many reasons I have a hard time taking cloud optimization seriously, the lack of direct tiering controls means that database/etc style workloads are not going to optimize well and that will end up costing a lot of $$$$$.
So, maybe it was the instance types/configuration I was using, but <shrug> it was just something I was testing in passing.
# nvme format /dev/nvme1 -n1 -f
NVMe status: INVALID_OPCODE: The associated command opcode field is not valid(0x2001)
# nvme id-ctrl /dev/nvme1 | grep oacs
oacs : 0
but the LBA format indeed is sus: LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)
1: Which is really a cloud without a certain degree of software defined networking/compute/storage/whatever.
LVM/block like you suggest is a good idea. You'd be surprised how much access time is trimmed by skipping another filesystem like you'd have with a raw image file
https://aws.amazon.com/about-aws/whats-new/2022/11/introduci...
# fio --name=read_iops_test --filename=/dev/nvme1n1 --filesize=1500G --time_based --ramp_time=1s --runtime=15s --ioengine=io_uring --fixedbufs --direct=1 --verify=0 --randrep
eat=0 --bs=4K --iodepth=256 --rw=randread --iodepth_batch_submit=256 --iodepth_batch_complete_max=256
read_iops_test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=256
fio-3.32
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=2082MiB/s][r=533k IOPS][eta 00m:00s]
read_iops_test: (groupid=0, jobs=1): err= 0: pid=34235: Tue Feb 20 22:57:00 2024
read: IOPS=534k, BW=2086MiB/s (2187MB/s)(30.6GiB/15001msec)
slat (nsec): min=713, max=255840, avg=31174.74, stdev=16248.45
clat (nsec): min=1419, max=1175.6k, avg=443782.26, stdev=277389.66
lat (usec): min=133, max=1240, avg=474.96, stdev=274.50
clat percentiles (usec):
| 1.00th=[ 169], 5.00th=[ 198], 10.00th=[ 217], 20.00th=[ 243],
| 30.00th=[ 265], 40.00th=[ 285], 50.00th=[ 306], 60.00th=[ 334],
| 70.00th=[ 396], 80.00th=[ 865], 90.00th=[ 922], 95.00th=[ 947],
| 99.00th=[ 996], 99.50th=[ 1012], 99.90th=[ 1045], 99.95th=[ 1057],
| 99.99th=[ 1074]
bw ( MiB/s): min= 2080, max= 2092, per=100.00%, avg=2086.72, stdev= 2.35, samples=30
iops : min=532548, max=535738, avg=534199.13, stdev=601.82, samples=30
lat (usec) : 2=0.01%, 100=0.01%, 250=23.06%, 500=50.90%, 750=0.28%
lat (usec) : 1000=24.90%
lat (msec) : 2=0.87%
cpu : usr=14.17%, sys=67.83%, ctx=156851, majf=0, minf=37
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=7.8%, 8=11.3%, 16=39.7%, 32=30.6%, 64=10.5%, >=64=0.1%
complete : 0=0.0%, 4=5.3%, 8=9.5%, 16=40.3%, 32=32.4%, 64=12.4%, >=64=0.1%
issued rwts: total=8010661,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=2086MiB/s (2187MB/s), 2086MiB/s-2086MiB/s (2187MB/s-2187MB/s), io=30.6GiB (32.8GB), run=15001-15001msec
Disk stats (read/write):
nvme1n1: ios=8542481/0, merge=0/0, ticks=3822266/0, in_queue=3822266, util=99.37%
tldr: random 4k reads pretty much saturate the available 2GB/s bandwidth (this is on m6id)On the contrary, young people often show up having learned on their super fast Apple SSD or a top of the line gaming machine with NVMe SSD.
Many know what hardware can do. There’s no need to dunk on young people.
Anyway, the cloud performance realities are well know to anyone who works in cloud performance. It’s part of the game and it’s learned by anyone scaling a system. It doesn’t really matter what you could do if you build a couple RUs yourself and hauled them down to the data center, because beyond simple single-purpose applications with flexible uptime requirements, that’s not a realistic option.
Thanks for the info!
The problem is that this is not possible when the next IO request depends on the result of a previous one, like in a database where you must first read the index to know the location of the row data itself.
It might not help much with oops though. Amazing that we have PCIe 5.0 16GB/s and already are so near theoretical max (some lost to overhead), even on consumer cards.
Going enterprise for the drive-writes-per-day (DWPD) is 100% worth it for most folks, but I am morbidly curious how different the performance profile would be running enterprise vs non these days. But reciprocally the high DWPD drives (Kioxia CD8P-V for example is DWPD of 3) seems to often come with somewhat more mild sustained 4k write oops, making me think maybe there's a speed vs reliability tradeoff that could be taken advantage of from consumer drives in some cases; not sure who wants tons of iops but doesn't actually intend to hit their Total Drive Writes, but it save you some iops/$ if so. That said, I'm shocked to see the enterprise premium is a lot less absurd than it used to be! (If you can find stock.)
* Reading/fetching the data - usernames, phone number, message, etc.
* Generating the content for each message - it might be custom per person
* This is using a 3rd party API that might take anywhere from 100ms to 2s to respond, and you need to leave a connection open.
* Retries on errors, rescheduling, backoffs
* At least once or at most once sends? Each has tradeoffs
* Stopping/starting that many messages at any time
* Rate limits on some services you might be using alongside your service (network gateway, database, etc)
* Recordkeeping - did the message send? When?
A problem in applications such as databases is when the outcome of an IO operation is required to initiate the next one - for example, you must first read an index to know the on-disk location of the actual row data. This is where the higher latency absolutely tanks performance.
A solution could be to make the storage drives smarter - have an NVME command that could say like "search in between this range for this byte pattern" and one that can say "use the outcome of the previous command as a the start address and read N bytes from there". This could help speed up the aforementioned scenario (effectively your drive will do the index scan & row retrieval for you), but would require cooperation between the application, the filesystem and the encryption system (typical, current FDE would break this).
Maybe TCO still favors bare-metal but you have to spend a lot of time on configuration.
[1] https://www.reddit.com/r/hetzner/comments/rjuzcs/securing_ne...
The success that VCs are after is when your customer base doubles every month. Better yet, every week. Having a reasonably scalable infra at the start ensures that a success won't kill you.
Of course, the chances of a runaway success like this are slim, so 99% or more startups overbuild, given their resulting customer base. But it's like 99% or more pilots who put on a parachute don't end up using it; the whole point is the small minority who do, and you never know.
For a stable, predictable, medium-scale business it may make total sense to have a few dedicated physical boxes and run their whole operation from them comfortably, for a fraction of cloud costs. But starting with it is more expensive than starting with a cloud, because you immediately need an SRE, or two.
You can get similar results by looking at comparisons between DPDK and kernel networking. Most of the usual gap comes from not needing to context-switch for kernel interrupt handling, zero-copy abstractions, and busy polling (wherein you trade CPU for lower latency instead of sleeping between iterations if there's no work to be done).
https://talawah.io/blog/linux-kernel-vs-dpdk-http-performanc... goes into some amount of detail comparing request throughput of an unoptimized kernel networking stack, optimized kernel networking stack, and DPDK. I'm not aware of any benchmarks (public or private) comparing Snap vs DPDK vs Linux, so that's probably as close as you'll get.
Yes, this is often a big surprise. You can test out some disk-heavy app locally on your laptop and observe decent performance, and then have your day completely ruined when you provision a slice of an NVMe SSD instance type (like, i4i.2xlarge) and discover you're only paying for SATA SSD performance.
- 2011 X-25E 64GB (2W write and almost nothing read/idle) at 100.000 writes per bit for OS
- 2021 PM897 3.7TB (2.3 Watt (read) ¦ 3 Watt (write) ¦ 1.4 Watt (idle) down from the PM983 (8.7 Watt (read) ¦ 10.6 Watt (write) ¦ 4 Watt (idle)) for DB.
This way I can get the most robust solution, with largest DB at lowest power. They are both in a 8-core Atom Mini-ITX board at 25W TDP.
Which is still not bad, when I was shopping around in 2018 no money could buy performance comparable to a locally-attached NVMe in a more professional/datacenter-ready form.
[1] https://media-www.micron.com/-/media/client/global/documents...
[2] https://www.marvell.com/content/dam/marvell/en/public-collat...
I have no idea how AWS run their VMs, was just saying a slow storage driver could give such results.
Ah: Oracle cloud infra
https://blogs.oracle.com/cloud-infrastructure/post/announcin...
I keep forgetting Oracle is in the cloud business too.
"Make it rain", I guess :)
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...
Oh, absolutely - not to contest that! There's a whole lot of academia on 'para-virtualized' and so on in this light.
That's interesting to hear about FreeBSD; basically all of my experience has been with Linux/Windows.
This matters when the DB calls a sync and it's expecting the data to be written safely to disk before it returns.
A consumer drive basically stops everything until it can report success and your IOPS falls to like 1/100th of what the drive is capable of if it's happening alot.
An enterprise drive with plp will just report success knowing it has the power to finish the pending writes. Full speed ahead.
You can "lie" to the process at the VPS level by enabling unsafe write back cache. You can do it at the OS level by launching the DB with "eatmydata". You will get the full performance of your SSD.
In the event of power loss you may well end up in an unrecoverable corrupted condition with these enabled.
I believe that if you buy all consumer parts - an enterprise drive is the best place to up spend your money profitably on an enterprise bit.
I frequently hear this point expressed in cloud vs colo debates. The notion that you can't achieve high availability with simple colo deploys is just nonsense.
Two colo deploys in two geographically distinct datacenters, two active physical servers with identical builds (RAIDed drives, dual NICs, A+B power) in both datacenters, a third server racked up just sitting as a cold spare, pick your favorite container orchestration scheme, rig up your database replication, script the database failover activation process, add HAProxy (or use whatever built-in scheme your orchestration system offers), sprinkle in a cloud service for DNS load balancing/failover (Cloudflare or AWS Route 53), automate and store backups off-site and you're done.
Yes it's a lot of work, but so is configuring a similar level of redundancy and high availability in AWS. I've done it both ways and I prefer the bare metal colo approach. With colo you get vastly more bang for your buck and when things go wrong, you have a greater ability to get hands on, understand exactly what's going on and fix it immediately.
Look at the big successes such as youtube, twitter, facebook, airbnb, lyft, google, yahoo - exactly zero of them did this preventatively. Even altavista and babelfish, done by DEC and running on Alphas, which they had plenty of, had to be redone multiple times due to growth. Heck, look at the first 5 years of Amazon. AWS was initially ideated in a contract job for Target.
Address the immediate and real needs and business cases, not pie in the sky aspirations of global dominance - wait until it becomes a need and then do it.
The chances of getting there are only reasonable if you move instead of plan, otherwise you'll miss the window and product opportunity.
I know it ruffles your engineering feathers - that's one of the reasons most attempts at building these things fails. The best ways feel wrong, are counterintuitive and are incidentally often executed by young college kids who don't know any better. It's why successful tech founders tend to be inexperienced; it can actually be advantageous if they make the right "mistakes".
Forget about any supposedly inevitable disaster until it's actually affecting your numbers. I know it's hard but the most controllable difference between success and failure in the startup space is in the behavioral patterns of the stakeholders.
Those i3 instances lose your data whenever you stop and start them again (ie migrate to a different host machine), there’s absolutely no reason they would use network.
EBS itself uses a different network than the “normal” internet, if I were to guess it’s a converged Ethernet network optimized for iSCSI. Which is what Nitro optimizes for as well. But it’s not relevant for the local NVMe storage.
This has an enormous economic impact. I once did a TCO study with AWS to run data-intensive workload running on purpose-built infrastructure on their cloud. AWS would have been 3x more expensive per their own numbers, they didn’t even argue it. The main difference is that we had highly optimized our storage configuration to provide exceptional throughput for our workload on cheap hardware.
I currently run workloads in the cloud because it is convenient. At scale though, the cost difference to run it on your own hardware is compelling. The cloud companies also benefit from a learned helplessness when it comes to physical infrastructure. Ironically, it has never been easier to do a custom infrastructure build, which companies used to do all the time, but most people act like it is deep magic now.
...now I'm actually interested in knowing if "droplet" is derived from "ocean", or if "Digital Ocean" was derived from having many droplets (which was derived from cloud). Maybe neither.
So the converse argument might be: don't bungle it up because you failed to plan. Provision for at least 10x growth with every (re-)implementation.
https://highscalability.com/friendster-lost-lead-because-of-...
I’m sure this is configurable in general though?
nvm, found it: https://dev.37signals.com/solid-cache/
Google's disk perform quite poorly.
And how Discord worked around it: https://discord.com/blog/how-discord-supercharges-network-di...
Does this mean you're colocating your own server in a data center somewhere? Or do you have your own data center/running it off a bare metal server with a business connection?
Just wondering if the TCO included the same levels of redundancy and bandwidth, etc.
There are also multiple Persistent Disk (https://cloud.google.com/persistent-disk) offerings that are backed by SSDs over the network.
(I'm an engineer on GCE. I work directly on the physical hardware that backs our virtualization platform.)
I don't have access to an EC2 instance to check, but you should be able to see the PCIe topology to determine how many physical cards are likely in i4i and im4gn and their PCIe connections. i4i claims to have 8 x 3,750 AWS Nitro SSD, but it isn't clear how many PCIe lanes are used.
Also, AWS claims "Traditionally, SSDs maximize the peak read and write I/O performance. AWS Nitro SSDs are architected to minimize latency and latency variability of I/O intensive workloads [...] which continuously read and write from the SSDs in a sustained manner, for fast and more predictable performance. AWS Nitro SSDs deliver up to 60% lower storage I/O latency and up to 75% reduced storage I/O latency variability [...]"
This could explain the findings in the article - they only meared peak r/w, not predictability.
[0] https://perspectives.mvdirona.com/2019/02/aws-nitro-system/ [1] https://aws.amazon.com/ec2/nitro/ [2] https://d1.awsstatic.com/events/reinvent/2019/REPEAT_2_Power...
The GP is incorrect.
> Local SSD disks are physically attached to the server that hosts your VM.
Disclosure: I work on GCE.
MySpace was the one that took the lead over Friendster and it withered after it got acquired for $500 million by news corp because that was the liquidity event. That's when Facebook gained ground. Your timeline is wrong.
The MySpace switch was because of themes and other features the users found more appealing. Twitter had similar crashes with its fail whale for a long time and they survived it fine. The teen exodus of Friendster wasn't because of TTLB waterfall graphs.
Also MySpace did everything on cheap Microsoft IIS 6 servers in ASP 2.0 after switching from Coldfusion in Macromedia HomeSite, they weren't genuises. It was a knockoff created by amateurs with a couple new twists. (A modern clone has 2.5 mil users: see https://spacehey.com/browse still mostly teenagers)
Besides, when the final Friendster holdout of the Asian market had exponential decline in 2008, the scaling problems of 5 years ago had long been fixed. Faster load times did not make up for a product consumers no longer found compelling.
Also Facebook initially was running literally out of Mark's dorm room. In 2007, after they had won the war, their code got leaked because they were deploying the .svn directory in their deploy strategy. Their code was widely mocked. So there we are again.
I don't care if you can find someone who agrees with you on the Friendster scaling thing, almost every collapsed startup has someone that says "we were just too successful and couldn't keep up" because thinking you were just too awesome is the gentler on the ego than realizing a bunch of scrappy hackers just gave people more of what they wanted and either you didn't realize it or you thought your lack of adaption was a virtue.
Unless there is an onprem movement I expect cloud to be the future as maintaining the tech stack onprem is difficult and we need to nake decisions down to the hardware we order.
That is transparently nonsense.
You can disprove that claim in 5 minutes, and it makes literally zero sense for offerings that aren't oversubscribed
If anything I'd guess it's a procurement issue, parity between regions is a big thing and it's hard to supply dozens of regions around the world with the latest hardware hotness
Yes it’s costly having the hardware to do that as it mostly meant multiple machines as I always wanted to be able to rebuild one whilst having at least two machines online.
IOPS indeed matters a lot, but so does latency! For our use case, it was much easier to saturate those disks than the old i3s, and we attribute it to the better latencies, making IO scheduling a lot more accurate.
Spin up an E2 VM in Google Cloud and there's a good chance you'll get a nearly 9 year Broadwell architecture chip running your workload!
The difference in cost could be attributed mostly to the server hardware build, and to a lesser extent the better scalability with a better network. In this case, we ended up working with Quanta on servers that had everything we needed and nothing we didn’t, optimizing heavily for bandwidth/$. We worked directly with storage manufacturers to find SKUs that stripped out features we didn’t need and optimized for cost per byte given our device write throughput and durability requirements. They all have hundreds of custom SKUs that they don’t publicly list, you just have to ask. A hidden factor is that the software was designed to take advantage of hardware that most enterprises would not deign to use for high-performance applications. There was a bit of supply chain management but we did this as a startup buying not that many units. The final core server configuration cost us just under $8k each delivered, and it outperformed every off-the-shelf server for twice the price and essentially wasn’t something you could purchase in the cloud (and still isn’t). These servers were brilliant, bulletproof, and exceptionally performant for our use case. You can model out the economics of this and the zero-crossing shows up at a lower burn rate than I think many people imagine.
We were extremely effective at using storage, and we did not attach it to expensive, overly-powered servers where the CPUs would have been sitting idle anyway. The sweet spot was low-clock high-core CPUs, which are typically at a low-mid price point but optimal performance-per-dollar if you can effectively scale software to the core count. Since the software architecture was thread-per-core, the core count was not a bottleneck. The economics have not shifted much over time.
AWS uses the same pricing model as everyone else in the server leasing game. Roughly speaking, you model your prices to recover your CapEx in 6 months of utilization. Ignoring overhead, doing it ourselves pulled that closer to 1.5-2 months for the same burn. This moves a lot of the cost structure to things like power, space, and bandwidth. We definitely were paying more for space and power than AWS (usually less for bandwidth) but not nearly enough to offset our huge CapEx advantage relative to workload.
All of this can be modeled out in Excel. No one does it anymore but I am from a time when it was common, so I have that skill in my back pocket. It isn’t nearly as much work as it sounds like, much of the details are formulaic. You do need to have good data on how your workload uses hardware resources to know what to build.
For comparison, a single 1 TB consumer SSD can deliver comparable numbers (lower IOPS but higher throughput).
If I plugged 24 consumer SSDs into a box, I would expect over 30M IOPS and near the memory bus limit for throughput (>50 GB/s).
[0] Assertion not valid for barrel processors.
Cloud is approx 5x sticker cost for compute if its sustained.
Your discounts may vary, rue the day those discounts are taken away because we are all sufficiently locked in.
I think the answer could be mclock scheduler.
https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...
Persistent Disk is not backed by single devices (even for a single NVMe attachment), but by multiple redundant copies spread across power and network failure domains. Those volumes will survive the failure of the VM to which they are attached as well as the failure of any individual volume or host.
In big data AI space, this is exactly what's happening with the top 20th to 100th companies in the world right now.
You're a highly technical user. Non-technical people are weird - part of the MySpace exodus was the belief that it spread "computer viruses", really
There was more to the switches but I'd have to dredge it up probably through archive sites these days. The reasons the surveys supported I considered ridiculous but it doesn't matter it's better to understand consumer behavior - we can't easily change it.
Especially these days. It was not possible for me to be a teenager with high speed wi-fi when I was one 30 years ago. I've got near zero understanding of the modern consumer youth market or what they think. Against all my expectations I've become an old person.
Anyways, the freeform HTML was a major driver - it was geocities with less effort, which had also exited through a liquidity event and currently has a clone these days https://neocities.org/browse
Which is ironic given that when building on-prem/colo'ed setups you'll replicate things to be prepared for unknown lengths of downtime while equipment is repaired or replaced, so this was largely cloud marketing coming bak to bite cloud providers' collective asses. Not wanting instances to die "randomly" for no good reason does not always mean wanting performance sacrifices for the sake of more resilient instances.
But AWS at least still offers plenty of instances with instance storage.
If I'm setting up my own database cluster, while I don't want it running on cheap consumer-grade hardware without dual power supplies and RAID, I also don't want to sacrifice SSD speed for something network-attached to survive a catastrophic failure when I'm going to have both backups, archived shipped logs and at least one replica anyway.
I'd happily pick network-attached storage for many things if it gets me increased resilience, but selling me a network-attached SSD, unless it replicates local SSD performance characteristics, is not competitive for applications where performance matters and I'm set up to easily handle system-wide failures anyway.
And this is one of the big "screts" AWS success: Shifting a lot of resource allocation and power from people with budgeting responsibility to developers who have usually never seen the budget or accounts, don't keep track, and at most retrospectively gets pulled in to explain line items in expenses, and obscuring it (to the point where I know people who've spent 6 figure amounts worth of dev time building analytics to figure out where their cloud spend goes... tooling has gotten better but is still awful)
I believe a whole lot of tech stacks would look very different if developers and architects were more directly involved in budgeting, and bonuses etc. were linked at least in part to financial outcomes affected by their technical choices.
A whole lot of claims to low cloud costs come from people who have never done actual comparisons and who seem to have a pathological fear of hardware, even when for most people you don't need to ever touch a physical box yourself - you can get maybe 2/3's of the savings with managed hosting as well.
You don't get the super-customized server builds, but you do get far more choice than with cloud providers, and you can often make up for the lack of fine-grained control by being able to rent/lease them somewhere where the physical hosting is cheaper (e.g. at a previous employer what finally made us switch to Hetzner for most new capacity was that while we didn't get exactly the hardware we wanted, we got "close enough" coupled with data centre space in their locations in Germany being far below data centre space in London - it didn't make them much cheaper, but it did make them sufficiently cheaper to outweigh the hardware differences with a margin sufficient for us to deploy new stuff there but still keep some of our colo footprint)
-edit: this comment was purely focused at your first sentence:
>The fastest SSDs tend to also be MLC which tend to have much lower write life vs other technologies.
I'm not sure what you mean with "other technologies" in this case, SLC is indeed truly expensive for a significantly higher write life, and HDDs are debatable for their lifespan.
Are you hiring?
Cloud is great for prototyping or randomly elastic workloads, but it feels like people are pushing highly static workloads from on-prem to cloud. I'd love to be part of the change going the other way. Especially since the skills for doing so seem to have dried up completely.
I buy Samsung drives relatively exclusively if that makes any difference.
All that to say though: this is why things like journalling and write-ahead systems exist. OS design is mostly about working around physical (often physics related) limitations of hardware and one of those is what to do if you get caught in a situation where something is incomplete.
The prevailing methodology is to paper over it with some atomic actions. For example: Copy-on-Write or POSIX move semantics (rename(2)).
Then some spiffy young dev comes along and turns off all of those guarantees and says they made something ultra fast (*cough*mongodb*cough*) then maybe claims those guarantees are somewhere up the stack instead. This is almost always a lie.
Also: Beware any database that only syncs to VFS.
Also Sometimes, it is poor communication. Just yesterday I saw some code that requests auth token before every request even though each bearer token comes with expires in (about twelve hours).
The difference is setting up all of that and maintaining it/debugging when something goes wrong is not a small task IMHO.
For some companies with that experience in-house I can understand doing it all yourself. As a solo founder and an employee of a small company we don’t have the bandwidth to do all of that without hiring 1+ more people which are more expensive than the cloud costs.
If we were drive-speed-constrained and getting that speed just wasn’t possible then maybe the math would shift further in favor of colo but we aren’t. Also upgrading the hardware our servers run on is fairly straightforward vs replacing a server on a rack or dealing with failing/older hardware.
The third party API is the part that has the potential to turn this straightforward task into a byzantine mess, though, so I suspect that's the missing piece of information.
I'm comparing this to my own experience with IRC, where handling the same or larger streams of messages is common. And that's with receiving this in real time, storing the messages, matching and potentially reacting to them, and doing all that while running on a raspberry pi.
The metrics you mention have to be pagecache hits. Basically all MLC NAND is in the double digit microseconds for uncontended random reads.
Consultants brought in to move our apps (some of which were Excel macros, others SAS scripts running on old desktop) to Azure. The Azure architects identified Postgres as the best tool. Consultants attempted to create a Postgres index in a small Azure instance but their tests would fail without completion (they were string concatenation rather than the native indexing function).
Consultants' conclusion: file too big for Postgres.
I disputed this. Plenty of literature out there on Pg handling bigger files. The Postgres (for Windows!) instance on my Core I7 laptop with an nVME drive could index the file about an hour. As an experiment I spun up a bare metal nVME instance on a Ryzen 7600 (lowest power, 6 core) Zen 4 CPU pc with a 1TB Samsung PCIe 4 nVME drive.
Got my index in 10 minutes.
I then tried to replicate this in Azure, upping the CPUs, memory, and to the nVME Azure CPU family (Ebsv5). Even at a $2000/mo level, I could not get the Azure instance any faster than one fifth (about an hour) of the speed of my bare metal experiment. I probably could have matched it eventually with more cores, but did not want to get called on the carpet for a ten grand Azure bill.
All this happened while I was working from home (one can't spin up an experimental bare metal system at a drop-in spot in the communal workroom).
What happened next I don't know, because I left in the midst of RTO fever. I was given the option of moving 1000 miles to commute to a hub office, or retire "voluntarily with severance." I chose the latter.
The random access Latency for L1 is indeed ~4 cycles, but RAM is more like 70ns+ (so you are .5 order of magnitude off, how dare you?).
These are my notes from an Anandtech article:
RAM:
5950x: 79 ns (400 cycles)
3950x: 86 ns (400 cycles)
10900k: 71 ns
Apple M1: 97 ns
Apple M1 Max: 111 ns
Apple A15P: 105 ns
Apple A15E: 141 ns
Apple A14P: 101 ns
Apple A14E: 201 ns
888 X1: 114 ns
L1 Cache:
5950x: 4 cycles (0.8 ns @ 5 GHz)
Apple M1: 3 cycles
Apple M1 Max: 3 cycles
[^1]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_I... [^2]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/how-ec2-...
Next up they’re going explain to us that iSCSI wants us to think it’s SCSI but it’s actually not!
Elastic compute means you want to be able to treat compute hardware as fungible. Persistent local storage makes that a lot harder because the Cloud provider wants to hand out that compute to someone else after shutdown, so the local storage needs to be wiped.
So you either get ephemeral local SSDs (and have to handle rebuild on restart yourself) or network-attached SSDs with much higher reliability and persistence, but a fraction of the performance.
Active instances can be migrated, of course, with sufficient cleverness in the I/O stack.
sudo fio --name=read_iops_test --filename=/dev/nvme0n1 --filesize=1500G --time_based --ramp_time=1s --runtime=15s --ioengine=io_uring --fixedbufs --direct=1 --verify=0 --randrepeat=0 --bs=4K --iodepth=256 --rw=randread --iodepth_batch_submit=256 --iodepth_batch_complete_max=256 --cpus_allowed=0-7
read_iops_test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=256
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=6078MiB/s][r=1556k IOPS][eta 00m:00s]
read_iops_test: (groupid=0, jobs=1): err= 0: pid=11085: Wed Feb 21 08:57:35 2024
read: IOPS=1555k, BW=6073MiB/s (6368MB/s)(89.0GiB/15001msec)
slat (nsec): min=401, max=93168, avg=7547.42, stdev=4396.47
clat (nsec): min=1426, max=1958.2k, avg=154599.19, stdev=92730.02
lat (usec): min=56, max=1963, avg=162.15, stdev=92.68
clat percentiles (usec):
| 1.00th=[ 71], 5.00th=[ 78], 10.00th=[ 83], 20.00th=[ 92],
| 30.00th=[ 100], 40.00th=[ 111], 50.00th=[ 124], 60.00th=[ 141],
| 70.00th=[ 165], 80.00th=[ 200], 90.00th=[ 265], 95.00th=[ 334],
| 99.00th=[ 519], 99.50th=[ 603], 99.90th=[ 807], 99.95th=[ 898],
| 99.99th=[ 1106]
bw ( MiB/s): min= 5823, max= 6091, per=100.00%, avg=6073.70, stdev=47.56, samples=30
iops : min=1490727, max=1559332, avg=1554866.87, stdev=12174.38, samples=30
lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=30.18%
lat (usec) : 250=58.12%, 500=10.55%, 750=1.00%, 1000=0.13%
lat (msec) : 2=0.02%
cpu : usr=25.41%, sys=74.57%, ctx=2395, majf=0, minf=58
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=5.7%, 8=14.8%, 16=54.8%, 32=24.3%, 64=0.3%, >=64=0.1%
complete : 0=0.0%, 4=2.9%, 8=13.0%, 16=56.9%, 32=26.8%, 64=0.3%, >=64=0.1%
issued rwts: total=23320075,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=6073MiB/s (6368MB/s), 6073MiB/s-6073MiB/s (6368MB/s-6368MB/s), io=89.0GiB (95.5GB), run=15001-15001msec
Disk stats (read/write):
nvme0n1: ios=24547748/0, merge=1/0, ticks=3702834/0, in_queue=3702835, util=99.35%
And then again with IOPS limited to ~2GB/s: sudo fio --name=read_iops_test --filename=/dev/nvme0n1 --filesize=1500G --time_based --ramp_time=1s --runtime=15s --ioengine=io_uring --fixedbufs --direct=1 --verify=0 --randrepeat=0 --bs=4K --iodepth=256 --rw=randread --iodepth_batch_submit=256 --iodepth_batch_complete_max=256 --cpus_allowed=0-7 --rate_iops=534000
read_iops_test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=256
fio-3.28
Starting 1 process
Jobs: 1 (f=1), 0-534000 IOPS: [r(1)][100.0%][r=2086MiB/s][r=534k IOPS][eta 00m:00s]
read_iops_test: (groupid=0, jobs=1): err= 0: pid=11114: Wed Feb 21 08:59:30 2024
read: IOPS=534k, BW=2086MiB/s (2187MB/s)(30.6GiB/15001msec)
slat (nsec): min=817, max=88336, avg=41533.20, stdev=7711.33
clat (usec): min=7, max=485, avg=93.19, stdev=39.73
lat (usec): min=65, max=536, avg=134.72, stdev=37.83
clat percentiles (usec):
| 1.00th=[ 32], 5.00th=[ 41], 10.00th=[ 47], 20.00th=[ 59],
| 30.00th=[ 70], 40.00th=[ 79], 50.00th=[ 89], 60.00th=[ 98],
| 70.00th=[ 110], 80.00th=[ 122], 90.00th=[ 145], 95.00th=[ 167],
| 99.00th=[ 217], 99.50th=[ 235], 99.90th=[ 277], 99.95th=[ 293],
| 99.99th=[ 334]
bw ( MiB/s): min= 2084, max= 2086, per=100.00%, avg=2086.08, stdev= 0.38, samples=30
iops : min=533715, max=534204, avg=534037.57, stdev=97.91, samples=30
lat (usec) : 10=0.01%, 20=0.04%, 50=12.42%, 100=49.30%, 250=37.97%
lat (usec) : 500=0.28%
cpu : usr=11.48%, sys=27.35%, ctx=2278177, majf=0, minf=58
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=0.4%, 8=0.2%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=99.3%
complete : 0=0.0%, 4=95.4%, 8=4.5%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.0%
issued rwts: total=8009924,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=2086MiB/s (2187MB/s), 2086MiB/s-2086MiB/s (2187MB/s-2187MB/s), io=30.6GiB (32.8GB), run=15001-15001msec
Disk stats (read/write):
nvme0n1: ios=8543389/0, merge=0/0, ticks=934147/0, in_queue=934148, util=99.33%
edit: formatting...I fire up vCPU or dedicated or bare metal in the cloud, doesn't matter, I simply cannot match the equivalent compute of real hardware and it's not even close.
You should not have needed an Ebsv5 (memory-optimised) instance. For that kind of thing, you should only have needed a D-series VM with a premium storage data disk (or, if you wanted a hypervisor-adjacent, very low latency volume, a temp volume in another SKU).
Anyway, many people fail to understand that Azure Storage works more like a SAN than a directly attached disk--when you attach a disk volume to the VM, you are actually attaching a _replica set_ of that storage that is at least three-way replicated and distributed across the datacenter to avoid data loss. You get RAID for free, if you will.
That is inherently slower than a hypervisor-adjacent (i.e., on-board) volume.
I've said this a bit more sarcastically elsewhere in this thread, but basically, why would you expect people to understand this? Cloud is sold as abstracting away hardware details and giving performance SLAs billed by the hour (or minute, second, whatever). If you need to know significant details of their implementation, then you're getting to the point where you might as well buy your own hardware and save a bunch of money (which seems to be gaining some steam in a minor but noticeable cloud repatriation movement).
What I chose ultimately was definitely "nVME attached" and definitely pricey. The "hypervisor-adjacent, very low latency volume" was not an obvious choice.
The best performing configuration did come from me--the db admin learning Azure on the fly--and not the four Azure architects nor the half dozen consultants with Azure credentials brought onto the project.
In theory the ATX PSU reports imminent power loss with a mandatory notice of no less than 1ms; this would easily be enough to finish in-flight writes and record the zone state.
- VMs with SSDs can (in general -- there are exceptions for things like GPUs and exceptionally large instances) live migrate with contents preserved.
- GCE supports a timeboxed "restart in place" feature where the VM stays in limbo ("REPAIRING") for some amount of time waiting for the host to return to service: https://cloud.google.com/compute/docs/instances/host-mainten.... This mostly only applies to transient failures like power-loss beyond battery/generator sustaining thresholds, software crashes, etc.
- There is a related feature, also controlled by the `--discard-local-ssd=` flag, which allows preservation of local SSD data on a customer initiated VM stop.
And they absolutely must understand this to avoid mis-designing things. Failure to do so is just bad engineering, and a LOT of time is spent educating customers on these differences.
A case in point that aligns with is that I used to work with Hadoop clusters, where you would use data replication for both redundancy and distributed processing. Moving Hadoop to Azure and maintaining conventional design rules (i.e., tripling the amount of disks) is the wrong way do do things, because it isn't required neither for redundancy nor for performance (they are both catered for by the storage resources).
(Of course there are better solutions than Hadoop these days - Spark being one that is very nice from a cloud resource perspective - but many people have nine times the storage they need allocated in their cloud Hadoop clusters because of lack of understanding...)
It's really easy to reproduce (at least for me?) and I'm pretty sure anyone can do it if they try to on purpose.
Now if you can show me two or more hosts connected to a box of SSDs through a PCI switch (and some sort of cool tech for coordinating between the hosts), that's interesting.
Yes, the cloud is _different_ to manage and has some of the same fundamentals to overcome such as security and networking, but lacks some of the very large pain points of managing an OS, like updates, ancillary local services, local accounts, and so on.
I'm not sure why you would state that it doesn't solve the problem I'm invested in -- namely operating websites. It is the perfect cloud workload.
To actually deliver on that promise while maintaining abstraction of just “dump your data on C:/ as you are used to”, there are compromises in performance that need to be taken. This is one of the biggest pitfalls of the cloud if you care more about performance than resiliency. Finding disks that don’t have such guarantees is still possible, just be aware of it.
I would vaguely expect it to not match my workstation, sure, but all throughout this thread (and others) people have cited outrageous disparities i.e. 5x less performance that you'd expect even if you managed your expectations to e.g. 2x less due to the cloud compute not being a bare metal machine.
In other words, and to illustrate this with a bad example: I'd be fine paying for an i7 CPU and ending up at i5 speeds... but I'm absolutely not fine with ending up at Celeron speeds.
But that isn't the delta I'm seeing, it's 5-10x performance delta not a 30-50% delta.
I'd expect that most of the work from a SSD read is offloaded to the disk controller, which presumably uses DMA, and you don't have nearly as many round trips (a sequential read can be done with a single SCSI command).
I'm inclined to agree with the explanation given by other commenters that the limiting factor for SSD r/w speeds in the cloud is due to throttling in the hypervisor to provide users with predictable performance as well as isolation in a multitenant environment.
However, the disks are still remote replicas sets as someone else mentioned. They’re not flash drives plugged into the host, despite appearances.
Something to try is (specifically) the Ebdsv5 series with the ‘d’ meaning it has local SSD cache and temp disks. Configure Postgres to use the temp disk for its scratch space and turn on read/write caching for the data disks.
You should see better performance, but still not as good as a laptop… that will have to wait for the v6 generation of VMs.
> Amazon SimpleDB measures the machine utilization of each request and charges based on the amount of machine capacity used to complete the particular request (SELECT, GET, PUT, etc.), normalized to the hourly capacity of a circa 2007 1.7 GHz Xeon processor. See below for a more detailed description of how machine utilization charges are calculated.
Internally, we say droplet instead of host as there are multiple hosts/mobos per droplet these days. It’s no longer true that when you get a metal droplet, you get the entire droplet.
None came within an order of magnitude of a Ryzen 7600/nVME mobo sitting in my son's old gaming case.
An option I did not try was Ultra disk, which I recall being significantly more expensive and was not part of the standard corporate offering. I wasn't itching to get dragged in front of the architecture review board again, anyway.
OpenChannelIO: stop making such absurd totalizing systems that hide the actual flash! We don't need your fancy controllers! Just give us a bunch of flash that we can individually talk to & control ourselves from the host! You leave so much performance on the table! There's so many access patterns that wouldn't suck & would involve random rewrites of blocks!
Then after years of squabbling, eventually: ok we have negotiated for years. We will build new standards to allow these better access patterns that require nearly no drive controller intermediation, that have exponentially higher degrees of mechanistic sympathy of what flash actually does. Then we will never release drives mainstream & only have some hard to get unbelievably expensive enterprise drives that support it.
You are so f-ing spot on. This ZNS would be perfect for lower reliability consumer drives. But: market segmentation.
The situation is so fucked. We are so impeded from excelling, as a civilization, case number nine billion three hundred forty two.
From Google's perspective, if the hardware is paid for, still reliable, and they can still make money on it, they can put new hardware in new racks rather than replacing the old hardware. This suggests Google's DC's aren't space constrained but I'm not surprised after looking at a few via satellite images!
I think you may be conflating the fact that across two VPCs you may be slightly more likely to be doing a cross availability zone or potentially even cross region network hop? I just think it's important to be on the pulse of what's really going on here
I don't think this is true, because the old chips don't use more power outright[1][2][3]. In fact in many cases new chips use more power due to the higher core density. The new chips are way more efficient because they do more work per watt, but like I said in my previous comment you aren't paying for a unit of work. The billing model for the cloud providers is that of a rental: you pay per minute for the instance.
There's complexity here like being able to pack more "instances" (VM's) onto a physical host with the higher core count machines, but simply saying the new hardware is cheaper to run I don't think is clear cut.
[1]: https://cloud.google.com/compute/docs/cpu-platforms#intel_pr...
[2]: https://www.intel.com/content/www/us/en/products/sku/93792/i...
[3]: https://www.intel.com/content/www/us/en/products/sku/231746/...
Search terms include “lapse rate” if you would like to learn more.