This branch of the discussion is is about dereferencing on multiple threads concurrently. That doesn't block the application, each mmap'd dereference only blocks its own thread (same as doing read()).
In my own measurements with NVMe RAID, doing this works very well on Linux for storage I/O.
I was getting similar performance to io_uring with O_DIRECT, and faster performance when the data is likely to be in the page cache on multiple runs, because the multi-threaded mmap method shares the kernel page cache without copying data.
To measure this, replace the read() calls in the libuv thread pool function with single-byte dereferences, mmap a file, and call a lot of libuv async reads. That will make libuv do the dereferences in its thread pool and return to the main application thread having fetched the relevant pages. Make sure libuv is configured to use enough threads, as it doesn't use enough by default.