io_uring is faster than mmap

(www.bitflux.ai)

283 points ghuntley | 3 comments | 04 Sep 25 22:01 UTC | HN request time: 0.006s | source

Show context

ayende ◴[05 Sep 25 05:57 UTC] No.45135399[source]▶

This is wrong, because your mmap code is being stalled for page faults (including soft page faults that you have when the data is in memory, but not mapped to your process).

The io_uring code looks like it is doing all the fetch work in the background (with 6 threads), then just handing the completed buffers to the counter.

Do the same with 6 threads that would first read the first byte on each page and then hand that page section to the counter, you'll find similar performance.

And you can use both madvice / huge pages to control the mmap behavior

replies(4): >>45135629 #>>45138707 #>>45140052 #>>45147766 #

mrlongroots ◴[05 Sep 25 14:02 UTC] No.45138707[source]▶

>>45135399 #

Yes, it doesn't take a benchmark to find out that storage can not be faster than memory.

Even if you had a million SSDs and somehow were able to connect them to a single machine somehow, you would not outperform memory, because the data needs to be read into memory first, and can only then be processed by the CPU.

Basic `perf stat` and minor/major faults should be a first-line diagnostic.

replies(3): >>45139067 #>>45143065 #>>45152315 #

1. johncolanduoni ◴[05 Sep 25 20:10 UTC] No.45143065[source]▶

>>45138707 #

This was a comparison of two methods of moving data from the VFS to application memory. Depending on cache status this would run the whole gambit of mapping existing memory pages, kernel to userspace memory copies, and actual disk access.

Also, while we’re being annoyingly technical, a lot of server CPUs can DMA straight to the L3 cache so your proof of impossibility is not correct.

replies(1): >>45147872 #

2. mrlongroots ◴[06 Sep 25 09:35 UTC] No.45147872[source]▶

>>45143065 (TP) #

> This was a comparison of two methods of moving data from the VFS to application memory. Depending on cache status this would run the whole gambit of mapping existing memory pages, kernel to userspace memory copies, and actual disk access.

Yes, I think maybe a reasonable statement is that a benchmark is supposed to isolate a meaningful effect. This benchmark was not set up correctly to isolate a meaningful effect IMO.

> Also, while we’re being annoyingly technical, a lot of server CPUs can DMA straight to the L3 cache so your proof of impossibility is not correct.

Interesting, didn't know that, thanks!

I think this does not invalidate the point though. You can temporarily stream directly to the L3 cache with DDIO, but as it fills out the cache will get flushed back to the main memory anyway and you will ultimately be memory-bound. I don't think there is some way to do some non-temporal magic here that circumvents main memory entirely.

replies(1): >>45155285 #

3. johncolanduoni ◴[07 Sep 25 04:01 UTC] No.45155285[source]▶

>>45147872 #

I agree, not a well designed benchmark. The mmap side did not use any of the mechanisms for paging in data before faulting each page, which are table stakes for this usecase.

The point of DDIO is that the data is frequently processed fast enough that it can get pulled from L3 to L1 and be finished with before it needs to be flushed to main memory. For a system with a coherent L3 or appropriate NUMA this isn’t really non-temporal, it’s more like the L3 cache is shared with the device.

↑