io_uring is faster than mmap

splice / vmsplice with pipes would probably be the fastest for the use case of file -> pipe -> user-space -> CPU processing. Problem is, you eventually have to access user-space memory to count integers, which breaks the zero-copy advantage for this workload. It probably won't beat the already vectorized O_DIRECT method.

What about direct-access hardware with DMA to CPU caches (NUMA-aware / cache-pinned)? PCIe NVMe controllers can use DMA to user-space buffers (O_DIRECT or SPDK) and if buffers are pinned to CPU local caches, then you can avoid main memory latency. It does require SPDK (Storage Performance Development Kit) or similar user-space NVMe drivers, however. IMO this is likely faster than io_uring + O_DIRECT, especially on large datasets, the only problem is that it requires specialized user-space NVMe stacks and careful buffer alignment.

You could also aggressively unroll loops and use prefetch instructions to hide memory latency, i.e. SIMD / AVX512 vectorization with prefetching. The blog post used AVX2 (128/256-bit) but on a modern CPU, AVX512 can process 512 bits (16 integers) per instruction and manual prefetching of data can reduce L1/L2 cache misses. I think this could beat AVX counting from the blog.

As for MAP_HUGETLB, 4KB page faults add overhead with a normal mmap. You want to map large pages (2 MB / 1 GB) which means fewer page table walks and reduced TLB misses reduced, meaning on a 50 GB dataset, using hugepages can reduce kernel overhead and speed up mmap-based counting, but it is still likely slightly slower than O_DIRECT + vectorized counting, so disregard.

TL;DR:

I think the faster practical CPU method is: O_DIRECT + aligned buffer + AVX512 + unrolled counting + NUMA-aware allocation.

Theoretically fastest (requires specialized setup): SPDK / DMA -> CPU caches + AVX512, bypassing main memory.

Mmap or page-cache approaches are neat but always slower for large sequential workloads.

Just my two cents. Thoughts?