What about direct-access hardware with DMA to CPU caches (NUMA-aware / cache-pinned)? PCIe NVMe controllers can use DMA to user-space buffers (O_DIRECT or SPDK) and if buffers are pinned to CPU local caches, then you can avoid main memory latency. It does require SPDK (Storage Performance Development Kit) or similar user-space NVMe drivers, however. IMO this is likely faster than io_uring + O_DIRECT, especially on large datasets, the only problem is that it requires specialized user-space NVMe stacks and careful buffer alignment.
You could also aggressively unroll loops and use prefetch instructions to hide memory latency, i.e. SIMD / AVX512 vectorization with prefetching. The blog post used AVX2 (128/256-bit) but on a modern CPU, AVX512 can process 512 bits (16 integers) per instruction and manual prefetching of data can reduce L1/L2 cache misses. I think this could beat AVX counting from the blog.
As for MAP_HUGETLB, 4KB page faults add overhead with a normal mmap. You want to map large pages (2 MB / 1 GB) which means fewer page table walks and reduced TLB misses reduced, meaning on a 50 GB dataset, using hugepages can reduce kernel overhead and speed up mmap-based counting, but it is still likely slightly slower than O_DIRECT + vectorized counting, so disregard.
TL;DR:
I think the faster practical CPU method is: O_DIRECT + aligned buffer + AVX512 + unrolled counting + NUMA-aware allocation.
Theoretically fastest (requires specialized setup): SPDK / DMA -> CPU caches + AVX512, bypassing main memory.
Mmap or page-cache approaches are neat but always slower for large sequential workloads.
Just my two cents. Thoughts?