io_uring is faster than mmap

My first thought is that what's different here isn't async, per se, but parallelism. io_uring uses a kernel thread pool to service I/O requests, so you actually end up with multiple threads running in parallel handling bookkeeping work. AFAIU, SSD controllers also can service requests in parallel, even if the request stream is serialized. These two sources of parallelism is why the I/O results come back out-of-order.

Generic readahead, which is what the mmap case is relying on, benefits from at least one async thread running in parallel, but I suspect for any particular file you effectively get at most one thread running in parallel to fill the page cache.

What may also be important is the VM management. The splice and vmsplice syscalls came about because someone requested that Linux adopt a FreeBSD optimization--for sufficiently sized write calls (i.e. page size or larger), the OS would mark the page(s) CoW and zero-copy the data to disk or the network. But Linus measured that the cost of fiddling with VM page attributes on each call was too costly and erased most of the zero-copy benefit. So another thing to take note of is that the io_uring case doesn't induce any page faults at all or require any costly VM fiddling (the shared io_uring buffers are installed upfront), whereas in the mmap case there are many page faults and fixups, possibly as many as one for every 4K page. The io_uring case may even result in additional data copies, but with less cost than the VM fiddling, which is even greater now than 20 years ago.