←back to thread

SSDs have become fast, except in the cloud

(databasearchitects.blogspot.com)
589 points greghn | 1 comments | | HN request time: 0.001s | source
Show context
jiggawatts ◴[] No.39446520[source]
There's a lot of talk about cloud network and disk performance in this thread. I recently benchmarked both Azure and AWS and found that:

- Azure network latency is about 85 microseconds.

- AWS network latency is about 55 microseconds.

- Both can do better, but only in special circumstances such as RDMA NICs in HPC clusters.

- Cross-VPC or cross-VNET is basically identical. Some people were saying it's terribly slow, but I didn't see that in my tests.

- Cross-zone is 300-1200 microseconds due to the inescapable speed of light delay.

- VM-to-VM bandwidth is over 10 Gbps (>1 GB/s) for both clouds, even for the smallest two vCPU VMs!

- Azure Premium SSD v1 latency varies between about 800 to 3,000 microseconds, which is many times worse than the network latency.

- Azure Premium SSD v2 latency is about 400 to 2,000 microseconds, which isn't that much better, because:

- Local SSD caches in Azure are so much faster than remote disk that we found that Premium SSD v1 is almost always faster than Premium SSD v2 because the latter doesn't support caching.

- Again in Azure, the local SSD "cache" and also the local "temp disks" both have latency as low as 40 microseconds, on par with a modern laptop NVMe drive. We found that switching to the latest-gen VM SKU and turning on the "read caching" for the data disks was the magic "go-fast" button for databases... without the risk of losing out data.

We investigated the various local-SSD VM SKUs in both clouds such as the Lasv3 series, and as the article mentioned, the performance delta didn't blow my skirt up, but the data loss risk made these not worth the hassle.

replies(1): >>39446865 #
computerdork ◴[] No.39446865[source]
Interesting. And would you happen to have the numbers on the performance of the local SSD? Is it's read and write throughput up to the level of modern SSD's?
replies(1): >>39447325 #
jiggawatts ◴[] No.39447325[source]
It's pretty much like how the article said. The cloud local SSDs are notably slower than what you'd get in an ordinary laptop, let alone a high-end server.

I'm not an insider and don't have any exclusive knowledge, but from reading a lot about the topic my impression is that the issue in both clouds is the virtualization overheads.

That is, having the networking or storage go through any hypervisor software layer is what kills the performance. I've seen similar numbers with on-prem VMware, Xen, and Nutanix setups as well.

Both clouds appear to be working on next-generation VM SKUs where the hypervisor network and storage functions are offloaded into 100% hardware, either into FPGAs or custom ASICs.

"Azure Boost" is Microsoft's marketing name for this, and it basically amounts to both local and remote disks going through an NVMe controller directly mapped into the memory space of the VM. That is, the VM OS kernel talks directly to the hardware, bypassing the hypervisor completely. This is shown in their documentation diagrams: https://learn.microsoft.com/en-us/azure/azure-boost/overview

They're claiming up to 3.8M IOPS for a single VM, which is 3-10x what you'd get out of a single NVMe SSD stick, so... not too shabby at all!

Similarly, Microsoft Azure Network Adapter (MANA) is the equivalent for the NIC, which will similarly connect the VM OS directly into the network, bypassing the hypervisor software.

I'm not an AWS expert, but from what I've seen they've been working on similar tech (Nitro) for years.

replies(2): >>39447827 #>>39448240 #
1. computerdork ◴[] No.39448240[source]
Makes a lot of sense! Yeah, seems like for the OP's performance-issue, you pretty much have the reason why it's happening (VM overhead) and solutions for it (bypassing the software layer using custom hardware like Azure Boost).

Thanks for the info!