Most active commenters

mrlongroots(4)
johncolanduoni(3)

Popular/hot comments

>>45136283 #
>>45138707 #
>>45139067 #

←back to thread

io_uring is faster than mmap

(www.bitflux.ai)

1. ayende ◴[05 Sep 25 05:57 UTC] No.45135399[source]▶

>>45132710 (OP) #

This is wrong, because your mmap code is being stalled for page faults (including soft page faults that you have when the data is in memory, but not mapped to your process).

The io_uring code looks like it is doing all the fetch work in the background (with 6 threads), then just handing the completed buffers to the counter.

Do the same with 6 threads that would first read the first byte on each page and then hand that page section to the counter, you'll find similar performance.

And you can use both madvice / huge pages to control the mmap behavior

replies(4): >>45135629 #>>45138707 #>>45140052 #>>45147766 #

2. lucketone ◴[05 Sep 25 06:45 UTC] No.45135629[source]▶

>>45135399 (TP) #

It would seem you summarised whole post.

That’s the point: “mmap” is slow because it is serial.

replies(1): >>45136283 #

3. arghwhat ◴[05 Sep 25 08:29 UTC] No.45136283[source]▶

>>45135629 #

mmap isn't "serial", the code that was using the mapping was "serial". The kernel will happily fill different portions of the mapping in parallel if you have multiple threads fault on different pages.

(That doesn't undermine that io_uring and disk access can be fast, but it's comparing a lazy implementation using approach A with a quite optimized one using approach B, which does not make sense.)

replies(4): >>45136633 #>>45136749 #>>45136761 #>>45136970 #

4. amelius ◴[05 Sep 25 09:26 UTC] No.45136633{3}[source]▶

>>45136283 #

OK, so we need a comparison between a multi threaded mmap approach and io_uring. Which would be faster?

replies(1): >>45136885 #

5. immibis ◴[05 Sep 25 09:44 UTC] No.45136749{3}[source]▶

>>45136283 #

How do you do embarrassingly async memory access with mmap?

replies(1): >>45137619 #

6. ◴[05 Sep 25 09:45 UTC] No.45136761{3}[source]▶

>>45136283 #

7. nabla9 ◴[05 Sep 25 10:08 UTC] No.45136885{4}[source]▶

>>45136633 #

If the memory access pattern is the same, there are no significant differences.

replies(1): >>45145423 #

8. rafaelmn ◴[05 Sep 25 10:21 UTC] No.45136970{3}[source]▶

>>45136283 #

OK this is not my level of stack for over a decade now, but writing a multithreaded code that will generate the same pagefaults on a shared mmap buffer, as opposed to something that kernel io scheduler will do on your behalf, and presumably try to schedule optimally for your machine/workload - does not sound comparable.

Thats like arguing python is not slower than C++ because you could technically write a specialized AOT compiler for your python code that would generate equivalent assembly so in the end it is the same ?

9. lordgilman ◴[05 Sep 25 12:11 UTC] No.45137619{4}[source]▶

>>45136749 #

You dereference a pointer.

replies(1): >>45138631 #

10. icedchai ◴[05 Sep 25 13:55 UTC] No.45138631{5}[source]▶

>>45137619 #

From the application perspective, it's not truly async. On a deference, your app may be blocked indefinitely as data is paged into memory. In the early 2000's I worked on systems that made heavy use of mmap. In constrained ("dev") environments with slow disks, you could be blocked for several seconds...

replies(1): >>45138968 #

11. mrlongroots ◴[05 Sep 25 14:02 UTC] No.45138707[source]▶

>>45135399 (TP) #

Yes, it doesn't take a benchmark to find out that storage can not be faster than memory.

Even if you had a million SSDs and somehow were able to connect them to a single machine somehow, you would not outperform memory, because the data needs to be read into memory first, and can only then be processed by the CPU.

Basic `perf stat` and minor/major faults should be a first-line diagnostic.

replies(3): >>45139067 #>>45143065 #>>45152315 #

12. jlokier ◴[05 Sep 25 14:23 UTC] No.45138968{6}[source]▶

>>45138631 #

This branch of the discussion is is about dereferencing on multiple threads concurrently. That doesn't block the application, each mmap'd dereference only blocks its own thread (same as doing read()).

In my own measurements with NVMe RAID, doing this works very well on Linux for storage I/O.

I was getting similar performance to io_uring with O_DIRECT, and faster performance when the data is likely to be in the page cache on multiple runs, because the multi-threaded mmap method shares the kernel page cache without copying data.

To measure this, replace the read() calls in the libuv thread pool function with single-byte dereferences, mmap a file, and call a lot of libuv async reads. That will make libuv do the dereferences in its thread pool and return to the main application thread having fetched the relevant pages. Make sure libuv is configured to use enough threads, as it doesn't use enough by default.

replies(1): >>45143826 #

13. alphazard ◴[05 Sep 25 14:32 UTC] No.45139067[source]▶

>>45138707 #

> storage can not be faster than memory

This is an oversimplification. It depends what you mean by memory. It may be true when using NVMe on modern architectures in a consumer use case, but it's not true about computer architecture in general.

External devices can have their memory mapped to virtual memory addresses. There are some network cards that do this for example. The CPU can load from these virtual addresses directly into registers, without needing to make a copy to the general purpose fast-but-volatile memory. In theory a storage device could also be implemented in this way.

replies(3): >>45140329 #>>45143170 #>>45147924 #

14. arunc ◴[05 Sep 25 15:59 UTC] No.45140052[source]▶

>>45135399 (TP) #

Indeed. Use with mmap with MAP_POPULATE which will pre populate.

replies(1): >>45143586 #

15. Thaxll ◴[05 Sep 25 16:22 UTC] No.45140329{3}[source]▶

>>45139067 #

I mean even with their memory mapped, the physical access to the memory ( NVME ) will always be slower than the physical access to the RAM, right?

Which external device has memory access as fast or faster than generic RAM?

replies(1): >>45140586 #

16. usefulcat ◴[05 Sep 25 16:40 UTC] No.45140586{4}[source]▶

>>45140329 #

I have heard that some Intel NICs can put received data directly into L3 cache. That would definitely make it faster to access than if it were in main RAM.

If a NIC can do that over PCI, probably other PCI devices could do the same, at least in theory.

replies(1): >>45145204 #

17. johncolanduoni ◴[05 Sep 25 20:10 UTC] No.45143065[source]▶

>>45138707 #

This was a comparison of two methods of moving data from the VFS to application memory. Depending on cache status this would run the whole gambit of mapping existing memory pages, kernel to userspace memory copies, and actual disk access.

Also, while we’re being annoyingly technical, a lot of server CPUs can DMA straight to the L3 cache so your proof of impossibility is not correct.

replies(1): >>45147872 #

18. johncolanduoni ◴[05 Sep 25 20:19 UTC] No.45143170{3}[source]▶

>>45139067 #

On a modern desktop/server CPU, the RAM memory and PCIe device mapped memory do not share a bus. The equivalence is a fiction maintained by the MMU. Some chips (e.g. Apple Silicon) have unified memory such that RAM is accessible from the CPU and devices (GPU) on a shared bus, but this is a little different.

Also, direct access of device memory is quite slow. High throughput usecases like storage or network have relied entirely on DMA to system RAM from the device for decades.

19. jared_hulbert ◴[05 Sep 25 21:03 UTC] No.45143586[source]▶

>>45140052 #

Someone else suggested this, results are even worse by 2.5s.

20. ozgrakkurt ◴[05 Sep 25 21:27 UTC] No.45143826{7}[source]▶

>>45138968 #

Out of topic but are you able to get performance benefit out of using RAID with NVMe disks?

21. wahern ◴[06 Sep 25 00:10 UTC] No.45145204{5}[source]▶

>>45140586 #

For the curious,

> When a 100G NIC is fully utilized with 64B packets and 20B Ethernet overhead, a new packet arrives every 6.72 nanoseconds on average. If any component on the packet path takes longer than this time to process the individual packet, a packet loss occurs. For a core running at 3GHz, 6.72 nanoseconds only accounts for 20 clock cycles, while the DRAM latency is 5-10 times higher, on average. This is the main bottleneck of the traditional DMA approach.

> The Intel® DDIO technology in Intel® Xeon® processors eliminates this bottleneck. Intel® DDIO technology allows PCIe devices to perform read and write operations directly to and from the L3 cache, or the last level cache (LLC).

https://www.intel.com/content/www/us/en/docs/vtune-profiler/...

replies(1): >>45147879 #

22. jared_hulbert ◴[06 Sep 25 00:39 UTC] No.45145423{5}[source]▶

>>45136885 #

Just ran a version with 6 prefetching threads. I get 5.81GB/s. Same as the io_uring with 2 drives, but still a lot slower than the in memory solution.

23. guenthert ◴[06 Sep 25 09:11 UTC] No.45147766[source]▶

>>45135399 (TP) #

Well, yes, but isn't one motivation of io_uring to make user space programming simpler and (hence) less error prone? I mean, i/o error handling on mmap isn't exactly trivial.

24. mrlongroots ◴[06 Sep 25 09:35 UTC] No.45147872{3}[source]▶

>>45143065 #

> This was a comparison of two methods of moving data from the VFS to application memory. Depending on cache status this would run the whole gambit of mapping existing memory pages, kernel to userspace memory copies, and actual disk access.

Yes, I think maybe a reasonable statement is that a benchmark is supposed to isolate a meaningful effect. This benchmark was not set up correctly to isolate a meaningful effect IMO.

> Also, while we’re being annoyingly technical, a lot of server CPUs can DMA straight to the L3 cache so your proof of impossibility is not correct.

Interesting, didn't know that, thanks!

I think this does not invalidate the point though. You can temporarily stream directly to the L3 cache with DDIO, but as it fills out the cache will get flushed back to the main memory anyway and you will ultimately be memory-bound. I don't think there is some way to do some non-temporal magic here that circumvents main memory entirely.

replies(1): >>45155285 #

25. mrlongroots ◴[06 Sep 25 09:36 UTC] No.45147879{6}[source]▶

>>45145204 #

From a response elsewhere in this thread (my current understanding, could be wrong):

> You can temporarily stream directly to the L3 cache with DDIO, but as it fills out the cache will get flushed back to the main memory anyway and you will ultimately be memory-bound. I don't think there is some way to do some non-temporal magic here that circumvents main memory entirely.

replies(1): >>45148562 #

26. mrlongroots ◴[06 Sep 25 09:47 UTC] No.45147924{3}[source]▶

>>45139067 #

> The CPU can load from these virtual addresses directly into registers

This requires that device to bring meaningful amounts of its own memory. GPUs do that with VRAM. A storage device does not come with its own RAM, but interesting point!

27. wahern ◴[06 Sep 25 12:08 UTC] No.45148562{7}[source]▶

>>45147879 #

Ah. I guess it's only useful for an optimized router/firewall or network storage appliance, perhaps with a bespoke stack carefully tuned to quickly process and then hand the data back to the controller (via DDIO) before it flushes.

EDIT: Here's an interesting writeup about trying to make use of it with FreeBSD+netmap+ipfw: https://adrianchadd.blogspot.com/2015/04/intel-ddio-llc-cach... So it can work as advertised, it's just very constraining, requiring careful tuning if not outright rearchitecting your processing pipeline with the constraints in mind.

28. hinkley ◴[06 Sep 25 19:47 UTC] No.45152315[source]▶

>>45138707 #

I’m pretty sure that as of PCI-E 2 this is not true.

It’s only true if you need to process the data before passing it on. You can do direct DMA transfers between devices.

In which case one needs to remember that memory isn’t on the CPU. It has to beg for data just about as much as any peripheral. It uses registers and L1, which are behind two other layers of cache and an MMU.

29. johncolanduoni ◴[07 Sep 25 04:01 UTC] No.45155285{4}[source]▶

>>45147872 #

I agree, not a well designed benchmark. The mmap side did not use any of the mechanisms for paging in data before faulting each page, which are table stakes for this usecase.

The point of DDIO is that the data is frequently processed fast enough that it can get pulled from L3 to L1 and be finished with before it needs to be flushed to main memory. For a system with a coherent L3 or appropriate NUMA this isn’t really non-temporal, it’s more like the L3 cache is shared with the device.

↑