io_uring is faster than mmap

1. kragen ◴[05 Sep 25 09:44 UTC] No.45136753[source]▶

This is pretty great. I only learned to use perf_events to see annotated disassembly a few weeks ago, although I don't know how to interpret what I see there yet.

I suspect the slowness identified with mmap() here is somewhat fixable, for example by mapping already-in-RAM pages somewhat more eagerly. So it wouldn't be surprising to me (though see above for how much I'm not an expert) if next year mmap were faster than io_uring again.

replies(1): >>45141994 #

2. jared_hulbert ◴[05 Sep 25 18:33 UTC] No.45141994[source]▶

>>45136753 (TP) #

The io_uring solution avoids this whole effort of mapping. It doesn't have to map the already-in-RAM pages at all. It reuses a small set of buffers. So there is a lot of random cache-miss prone work that mmap() has to do that the io_uring solution avoids. If mmap() does this in the background it would cache up with io_uring. I'd then have to get a couple more drives to get io_uring to catch up. With enough drives I'd bet they'd be closer than you think. I still think I could get the io_uring to be faster than the mmap() even if the count never faulted, mostly because the io_uring has a smaller TLB footprint and can fit in L3 cache. But it'd be tough.

replies(1): >>45143282 #

3. kragen ◴[05 Sep 25 20:33 UTC] No.45143282[source]▶

>>45141994 #

I agree that io_uring is a fundamentally more efficient approach, but I think the performance limits you're currently measuring with mmap() aren't the fundamental ones imposed by the mmap() API, and I think that's what you're saying too?