Is the manual loop unrolling really necessary to get vectorized machine code? I would have guessed that the highest optimization levels in LLVM would be able to figure it out from the basic code. That's a very uneducated guess, though.
Also, curious if you tried using the MAP_POPULATE option with mmap. Could that improve the bandwidth of the naive in-memory solution?
> humanity doesn't have the silicon fabs or the power plants to support this for every moron vibe coder out there making an app.
lol. I bet if someone took the time to make a high-quality well-documented fast-IO library based on your io_uring solution, it would get use.