But the arm64 systems with 16K or 64K native pages would have fewer faults.
And, even better, put all the lines on the same chart, or at least with the same y axis scale (perhaps make them all relative to their base on the left), so that we can the relative rate of growth?
And, you can poke around in the linux kernel's source code to determine how it works. I had a related issue that I ended up digging around to find the answer to: what happens if you use mremap() to expand the mapping and it fails; is the old mapping still valid or not? Answer: it's still valid. I found that it was actually fairly easy to read linux kernel C code, compared to a lot (!) of other C libraries I've tried to understand.
When I put the lines on the same chart it made the y axis impossible to understand. The units are so different. Maybe I'll revisit that.
Yeah around 2000-2010 the doubling is noticeable. Interestingly it's also when alot of factors started to stagnate.
AMD has something similar.
The PCIe bus and memory bus both originate from the processor or IO die of the "CPU" when you use an NVMe drive you are really just sending it a bunch of structured DMA requests. Normally you are telling the drive to DMA to an address that maps to the memory, so you can direct it cache and bypass sending it out on the DRAM bus.
In theory... the specifics of what is supported exactly? I can't vouch for that.
You might be able to set up SPDK to send data directly into the cpu cache? It’s one of those things I’ve wanted to play with for years but honestly I don’t know enough about it.
With the Intel connection they might have explicit support for DDIO. Good idea.
Log axis solves this, and turns meaningless hockey sticks into generally a straightish line that you can actually parse. If it still deviates from straight, then you really know there's true changes in the trendline.
Lines on same chart can all be divided by their initial value, anchoring them all at 1. Sometimes they're still a mess, but it's always worth a try.
You're enormously knowledgeable and the posts were fascinating. But this is stats 101. Not doing this sort of thing, especially explicitly in favour of showing a hockey stick, undermines the fantastic analysis.
This in contrast with the io_uring worker method where you keep the thread busy by submitting requests and letting the kernel do the work without expensive crossings.
The 2g fully in-mem shows the CPU's real perf, the dip to 50gb is interesting, perhaps when going over 50% memory the Linux kernel evicts pages or something similar that is hurting perf, maybe plot a graph of perf vs test-size to see if there is an obvious cliff.
But with SPDK you'll be talking to the disk, not to files. If you changed io_uring to read from the disk directly with O_DIRECT, you wouldn't have those extra 6 threads either. SPDK would still be considerably more CPU efficient but not 6x.
DDIO is a pure hardware feature. Software doesn't need to do anything to support it.
Source: SPDK co-creator
I think I'm crossing the numa boundary which means some percentage of the accesses are higher latency.