io_uring is faster than mmap

Nice write-up with good information, but not the best. Comments below.

Are you using linux? I assume so since stating use of mmap() and mention using EPYC hardware (which counts out macOS). I suppose you could use any other *nix though.

> We'll use a 50GB dataset for most benchmarking here, because when I started this I thought the test system only had 64GB and it stuck.*

So the OS will (or could) prefetch the file into memory. OK.

> Our expectation is that the second run will be faster because the data is already in memory and as everyone knows, memory is fast.*

Indeed.

> We're gonna make it very obvious to the compiler that it's safe to use vector instructions which could process our integers up to 8x faster.

There are even-wider vector instructions by the way. But, you mention another page down:

> NOTE: These are 128-bit vector instructions, but I expected 256-bit. I dug deeper here and found claims that Gen1 EPYC had unoptimized 256-bit instructions. I forced the compiler to use 256-bit instructions and found it was actually slower. Looks like the compiler was smart enough to know that here.

Yup, indeed :)

Also note that AVX2 and/or AVX512 instructions are notorious for causing thermal throttling on certain (older by now?) CPUs.

> Consider how the default mmap() mechanism works, it is a background IO pipeline to transparently fetch the data from disk. When you read the empty buffer from userspace it triggers a fault, the kernel handles the fault by reading the data from the filesystem, which then queues up IO from disk. Unfortunately these legacy mechanisms just aren't set up for serious high performance IO. Note that at 610MB/s it's faster than what a disk SATA can do. On the other hand, it only managed 10% of our disk's potential. Clearly we're going to have to do something else.

In the worst case, that's true. But you can also get the kernel to prefetch the data.

See several of the flags, but if you're doing sequential reading you can use MAP_POPULATE [0] which tells the OS to start prefetching pages.

You also mention 4K page table entries. Page table entries can get to be very expensive in CPU to look up. I had that happen at a previous employer with an 800GB file; most of the CPU was walking page tables. I fixed it by using (MAP_HUGETLB | MAP_HUGE_1GB) [0] which drastically reduces the number of page tables needed to memory map huge files.

Importantly: when the OS realizes that you're accessing the same file a lot, it will just keep that file in memory cache. If you're only mapping it with PROT_READ and PROT_SHARED, then it won't even need to duplicate the physical memory to a new page: it can just re-use existing physical memory with a new process-specific page table entry. This often ends up caching the file on first-access.

I had done some DNA calculations with fairly trivial 4-bit-wide data, each bit representing one of DNA basepairs (ACGT). The calculation was pure bitwise operations: or, and, shift, etc. When I reached the memory bus throughput limit, I decided I was done optimizing. The system had 1.5TB of RAM, so I'd cache the file just by reading it upon boot. Initially caching the file would take 10-15 minutes, but then the calculations would run across the whole 800GB file in about 30 seconds. There were about 2000-4000 DNA samples to calculate three or four times a day. Before all of this was optimized, the daily inputs would take close to 10-16 hours to run. By the time I was done, the server was mostly idle.

[0]: https://www.man7.org/linux/man-pages/man2/mmap.2.html