io_uring is faster than mmap

(www.bitflux.ai)

283 points ghuntley | 4 comments | 04 Sep 25 22:01 UTC | HN request time: 0.001s | source

Show context

ayende ◴[05 Sep 25 05:57 UTC] No.45135399[source]▶

This is wrong, because your mmap code is being stalled for page faults (including soft page faults that you have when the data is in memory, but not mapped to your process).

The io_uring code looks like it is doing all the fetch work in the background (with 6 threads), then just handing the completed buffers to the counter.

Do the same with 6 threads that would first read the first byte on each page and then hand that page section to the counter, you'll find similar performance.

And you can use both madvice / huge pages to control the mmap behavior

replies(4): >>45135629 #>>45138707 #>>45140052 #>>45147766 #

mrlongroots ◴[05 Sep 25 14:02 UTC] No.45138707[source]▶

>>45135399 #

Yes, it doesn't take a benchmark to find out that storage can not be faster than memory.

Even if you had a million SSDs and somehow were able to connect them to a single machine somehow, you would not outperform memory, because the data needs to be read into memory first, and can only then be processed by the CPU.

Basic `perf stat` and minor/major faults should be a first-line diagnostic.

replies(3): >>45139067 #>>45143065 #>>45152315 #

alphazard ◴[05 Sep 25 14:32 UTC] No.45139067[source]▶

>>45138707 #

> storage can not be faster than memory

This is an oversimplification. It depends what you mean by memory. It may be true when using NVMe on modern architectures in a consumer use case, but it's not true about computer architecture in general.

External devices can have their memory mapped to virtual memory addresses. There are some network cards that do this for example. The CPU can load from these virtual addresses directly into registers, without needing to make a copy to the general purpose fast-but-volatile memory. In theory a storage device could also be implemented in this way.

replies(3): >>45140329 #>>45143170 #>>45147924 #

Thaxll ◴[05 Sep 25 16:22 UTC] No.45140329[source]▶

>>45139067 #

I mean even with their memory mapped, the physical access to the memory ( NVME ) will always be slower than the physical access to the RAM, right?

Which external device has memory access as fast or faster than generic RAM?

replies(1): >>45140586 #

1. usefulcat ◴[05 Sep 25 16:40 UTC] No.45140586[source]▶

>>45140329 #

I have heard that some Intel NICs can put received data directly into L3 cache. That would definitely make it faster to access than if it were in main RAM.

If a NIC can do that over PCI, probably other PCI devices could do the same, at least in theory.

replies(1): >>45145204 #

2. wahern ◴[06 Sep 25 00:10 UTC] No.45145204[source]▶

>>45140586 (TP) #

For the curious,

> When a 100G NIC is fully utilized with 64B packets and 20B Ethernet overhead, a new packet arrives every 6.72 nanoseconds on average. If any component on the packet path takes longer than this time to process the individual packet, a packet loss occurs. For a core running at 3GHz, 6.72 nanoseconds only accounts for 20 clock cycles, while the DRAM latency is 5-10 times higher, on average. This is the main bottleneck of the traditional DMA approach.

> The Intel® DDIO technology in Intel® Xeon® processors eliminates this bottleneck. Intel® DDIO technology allows PCIe devices to perform read and write operations directly to and from the L3 cache, or the last level cache (LLC).

https://www.intel.com/content/www/us/en/docs/vtune-profiler/...

replies(1): >>45147879 #

3. mrlongroots ◴[06 Sep 25 09:36 UTC] No.45147879[source]▶

>>45145204 #

From a response elsewhere in this thread (my current understanding, could be wrong):

> You can temporarily stream directly to the L3 cache with DDIO, but as it fills out the cache will get flushed back to the main memory anyway and you will ultimately be memory-bound. I don't think there is some way to do some non-temporal magic here that circumvents main memory entirely.

replies(1): >>45148562 #

4. wahern ◴[06 Sep 25 12:08 UTC] No.45148562{3}[source]▶

>>45147879 #

Ah. I guess it's only useful for an optimized router/firewall or network storage appliance, perhaps with a bespoke stack carefully tuned to quickly process and then hand the data back to the controller (via DDIO) before it flushes.

EDIT: Here's an interesting writeup about trying to make use of it with FreeBSD+netmap+ipfw: https://adrianchadd.blogspot.com/2015/04/intel-ddio-llc-cach... So it can work as advertised, it's just very constraining, requiring careful tuning if not outright rearchitecting your processing pipeline with the constraints in mind.

↑