This is pretty great. I only learned to use perf_events to see annotated disassembly a few weeks ago, although I don't know how to interpret what I see there yet.
I suspect the slowness identified with mmap() here is somewhat fixable, for example by mapping already-in-RAM pages somewhat more eagerly. So it wouldn't be surprising to me (though see above for how much I'm not an expert) if next year mmap were faster than io_uring again.
replies(1):