io_uring is faster than mmap

1. jared_hulbert ◴[04 Sep 25 23:18 UTC] No.45133330[source]▶

>>45132710 (OP) #

Cool. Original author here. AMA.

replies(5): >>45133433 #>>45133597 #>>45133666 #>>45133764 #>>45135337 #

2. Jap2-0 ◴[04 Sep 25 23:33 UTC] No.45133433[source]▶

>>45133330 (TP) #

Would huge pages help with the mmap case?

replies(2): >>45133546 #>>45133572 #

3. jared_hulbert ◴[04 Sep 25 23:49 UTC] No.45133546[source]▶

>>45133433 #

Oh man... I'd have look into that. Off the top of my head I don't know how you'd make that happen. Way back when I'd have said no. Now with all the folio updates to the Linux kernel memory handling I'm not sure. I think you'd have to take care to make sure the data gets into to page cache as huge pages. If not then when you tried to madvise() or whatever the buffer to use huge pages it would likely just ignore you. In theory it could aggregate the small pages into huge pages but that would be more latency bound work and it's not clear how that impacts the page cache.

But the arm64 systems with 16K or 64K native pages would have fewer faults.

replies(1): >>45133578 #

4. inetknght ◴[04 Sep 25 23:53 UTC] No.45133572[source]▶

>>45133433 #

> Would huge pages help with the mmap case?

Yes. Tens- or hundreds- of gigabytes of 4K page table entries take a while for the OS to navigate.

5. inetknght ◴[04 Sep 25 23:54 UTC] No.45133578{3}[source]▶

>>45133546 #

> I'd have look into that. Off the top of my head I don't know how you'd make that happen.

Pass these flags to your mmap call: (MAP_HUGETLB | MAP_HUGE_1GB)

replies(1): >>45133626 #

6. nchmy ◴[04 Sep 25 23:56 UTC] No.45133597[source]▶

>>45133330 (TP) #

I just saw this post so am starting with Part 1. Could you replace the charts with ones on some sort of log scale? It makes it look like nothing happened til 2010, but I'd wager its just an optical illusion...

And, even better, put all the lines on the same chart, or at least with the same y axis scale (perhaps make them all relative to their base on the left), so that we can the relative rate of growth?

replies(1): >>45133734 #

7. jared_hulbert ◴[05 Sep 25 00:02 UTC] No.45133626{4}[source]▶

>>45133578 #

Would this actually create huge page page cache entries?

replies(1): >>45133675 #

8. john-h-k ◴[05 Sep 25 00:10 UTC] No.45133666[source]▶

>>45133330 (TP) #

You mention modern server CPUs have capability to “read direct to L3, skipping memory”. Can you elaborate on this?

replies(1): >>45133767 #

9. inetknght ◴[05 Sep 25 00:11 UTC] No.45133675{5}[source]▶

>>45133626 #

It's right in the documentation for mmap() [0]! And, from my experience, using it with an 800GB file provided a significant speed-up, so I do believe the documentation is correct ;)

And, you can poke around in the linux kernel's source code to determine how it works. I had a related issue that I ended up digging around to find the answer to: what happens if you use mremap() to expand the mapping and it fails; is the old mapping still valid or not? Answer: it's still valid. I found that it was actually fairly easy to read linux kernel C code, compared to a lot (!) of other C libraries I've tried to understand.

[0]: https://www.man7.org/linux/man-pages/man2/mmap.2.html

10. jared_hulbert ◴[05 Sep 25 00:21 UTC] No.45133734[source]▶

>>45133597 #

I tried with the log scale before. They failed to express the exponential hockey stick growth unless you really spend the time with the charts and know what log scale is. I'll work on incorporating log scale due to popular demand. They do show the progress has been nice and exponential over time.

When I put the lines on the same chart it made the y axis impossible to understand. The units are so different. Maybe I'll revisit that.

Yeah around 2000-2010 the doubling is noticeable. Interestingly it's also when alot of factors started to stagnate.

replies(1): >>45134716 #

11. comradesmith ◴[05 Sep 25 00:26 UTC] No.45133764[source]▶

>>45133330 (TP) #

Thanks for the article. What about using file reads from a mounted ramdisk?

replies(1): >>45134658 #

12. jared_hulbert ◴[05 Sep 25 00:27 UTC] No.45133767[source]▶

>>45133666 #

https://www.intel.com/content/www/us/en/io/data-direct-i-o-t...

AMD has something similar.

The PCIe bus and memory bus both originate from the processor or IO die of the "CPU" when you use an NVMe drive you are really just sending it a bunch of structured DMA requests. Normally you are telling the drive to DMA to an address that maps to the memory, so you can direct it cache and bypass sending it out on the DRAM bus.

In theory... the specifics of what is supported exactly? I can't vouch for that.

replies(1): >>45134306 #

13. josephg ◴[05 Sep 25 01:58 UTC] No.45134306{3}[source]▶

>>45133767 #

I’d be fascinated to see a comparison with SPDK. That bypasses the kernel’s NVMe / SSD driver and controls the whole device from user space - which is supposed to avoid a lot of copies and overhead.

You might be able to set up SPDK to send data directly into the cpu cache? It’s one of those things I’ve wanted to play with for years but honestly I don’t know enough about it.

https://spdk.io/

replies(1): >>45134505 #

14. jared_hulbert ◴[05 Sep 25 02:39 UTC] No.45134505{4}[source]▶

>>45134306 #

spdk and I go way back. I'm confident it'd be about the same, possibly ~200-300MB/s more, I was pretty close to the rated throughput of the drives. Io_uring has really closed the gap that used to exist between the in kernel and userspace solutions.

With the Intel connection they might have explicit support for DDIO. Good idea.

replies(1): >>45139208 #

15. jared_hulbert ◴[05 Sep 25 03:13 UTC] No.45134658[source]▶

>>45133764 #

Hmm. tmpfs was slower. hugetlbfs wasn't working for me.

16. nchmy ◴[05 Sep 25 03:24 UTC] No.45134716{3}[source]▶

>>45133734 #

The hockey stick growth is the entire problem - it's an optical illusion resulting from the fact that going from 100 to 200 is the same rate as 200 to 400. And 800, 1600. You understand exponents.

Log axis solves this, and turns meaningless hockey sticks into generally a straightish line that you can actually parse. If it still deviates from straight, then you really know there's true changes in the trendline.

Lines on same chart can all be divided by their initial value, anchoring them all at 1. Sometimes they're still a mess, but it's always worth a try.

You're enormously knowledgeable and the posts were fascinating. But this is stats 101. Not doing this sort of thing, especially explicitly in favour of showing a hockey stick, undermines the fantastic analysis.

17. whizzter ◴[05 Sep 25 05:47 UTC] No.45135337[source]▶

>>45133330 (TP) #

Like people mention, hugetlb,etc could be an improvement, but the core issue holding it it down probably has to do with mmap, 4k pages and paging behaviours, mmap will cause faults for each "small" 4k page not in memory, causing a kernel jump and then whatever machinery to fill in the page-cache (and bring up data from disk with the associated latency).

This in contrast with the io_uring worker method where you keep the thread busy by submitting requests and letting the kernel do the work without expensive crossings.

The 2g fully in-mem shows the CPU's real perf, the dip to 50gb is interesting, perhaps when going over 50% memory the Linux kernel evicts pages or something similar that is hurting perf, maybe plot a graph of perf vs test-size to see if there is an obvious cliff.

replies(2): >>45141276 #>>45142133 #

18. benlwalker ◴[05 Sep 25 14:45 UTC] No.45139208{5}[source]▶

>>45134505 #

SPDK will be able to fully saturate the PCIe bandwidth from a single CPU core here (no secret 6 threads inside the kernel). The drives are your bottleneck so it won't go faster, but it can use a lot less CPU.

But with SPDK you'll be talking to the disk, not to files. If you changed io_uring to read from the disk directly with O_DIRECT, you wouldn't have those extra 6 threads either. SPDK would still be considerably more CPU efficient but not 6x.

DDIO is a pure hardware feature. Software doesn't need to do anything to support it.

Source: SPDK co-creator

19. pianom4n ◴[05 Sep 25 17:38 UTC] No.45141276[source]▶

>>45135337 #

The in-memory solution creates a 2nd copy of the data so 50GB doesn't fit in memory anymore. The kernel is forced to drop and then reload part of the cached file.

20. jared_hulbert ◴[05 Sep 25 18:45 UTC] No.45142133[source]▶

>>45135337 #

When I run the 50GB in-mem setup I still have 40GB+ of free memory, I drop the page cache before I run "sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'" there wouldn't really be anything to evict from page cache and swap isn't changing.

I think I'm crossing the numa boundary which means some percentage of the accesses are higher latency.