The io_uring code looks like it is doing all the fetch work in the background (with 6 threads), then just handing the completed buffers to the counter.
Do the same with 6 threads that would first read the first byte on each page and then hand that page section to the counter, you'll find similar performance.
And you can use both madvice / huge pages to control the mmap behavior