Most active commenters

jared_hulbert(19)
inetknght(14)
mrlongroots(5)
nine_k(4)
bawolff(3)
wahern(3)
wmf(3)
johncolanduoni(3)

Popular/hot comments

>>45133330 #
>>45134451 #
>>45134728 #
>>45135399 #
>>45136283 #
>>45133595 #
>>45134790 #
>>45138707 #
>>45139067 #

io_uring is faster than mmap

(www.bitflux.ai)

1. jared_hulbert ◴[04 Sep 25 23:18 UTC] No.45133330[source]▶

>>45132710 (OP) #

Cool. Original author here. AMA.

replies(5): >>45133433 #>>45133597 #>>45133666 #>>45133764 #>>45135337 #

2. Jap2-0 ◴[04 Sep 25 23:33 UTC] No.45133433[source]▶

>>45133330 #

Would huge pages help with the mmap case?

replies(2): >>45133546 #>>45133572 #

3. jared_hulbert ◴[04 Sep 25 23:49 UTC] No.45133546{3}[source]▶

>>45133433 #

Oh man... I'd have look into that. Off the top of my head I don't know how you'd make that happen. Way back when I'd have said no. Now with all the folio updates to the Linux kernel memory handling I'm not sure. I think you'd have to take care to make sure the data gets into to page cache as huge pages. If not then when you tried to madvise() or whatever the buffer to use huge pages it would likely just ignore you. In theory it could aggregate the small pages into huge pages but that would be more latency bound work and it's not clear how that impacts the page cache.

But the arm64 systems with 16K or 64K native pages would have fewer faults.

replies(1): >>45133578 #

4. inetknght ◴[04 Sep 25 23:52 UTC] No.45133560[source]▶

>>45132710 (OP) #

Nice write-up with good information, but not the best. Comments below.

Are you using linux? I assume so since stating use of mmap() and mention using EPYC hardware (which counts out macOS). I suppose you could use any other *nix though.

> We'll use a 50GB dataset for most benchmarking here, because when I started this I thought the test system only had 64GB and it stuck.*

So the OS will (or could) prefetch the file into memory. OK.

> Our expectation is that the second run will be faster because the data is already in memory and as everyone knows, memory is fast.*

Indeed.

> We're gonna make it very obvious to the compiler that it's safe to use vector instructions which could process our integers up to 8x faster.

There are even-wider vector instructions by the way. But, you mention another page down:

> NOTE: These are 128-bit vector instructions, but I expected 256-bit. I dug deeper here and found claims that Gen1 EPYC had unoptimized 256-bit instructions. I forced the compiler to use 256-bit instructions and found it was actually slower. Looks like the compiler was smart enough to know that here.

Yup, indeed :)

Also note that AVX2 and/or AVX512 instructions are notorious for causing thermal throttling on certain (older by now?) CPUs.

> Consider how the default mmap() mechanism works, it is a background IO pipeline to transparently fetch the data from disk. When you read the empty buffer from userspace it triggers a fault, the kernel handles the fault by reading the data from the filesystem, which then queues up IO from disk. Unfortunately these legacy mechanisms just aren't set up for serious high performance IO. Note that at 610MB/s it's faster than what a disk SATA can do. On the other hand, it only managed 10% of our disk's potential. Clearly we're going to have to do something else.

In the worst case, that's true. But you can also get the kernel to prefetch the data.

See several of the flags, but if you're doing sequential reading you can use MAP_POPULATE [0] which tells the OS to start prefetching pages.

You also mention 4K page table entries. Page table entries can get to be very expensive in CPU to look up. I had that happen at a previous employer with an 800GB file; most of the CPU was walking page tables. I fixed it by using (MAP_HUGETLB | MAP_HUGE_1GB) [0] which drastically reduces the number of page tables needed to memory map huge files.

Importantly: when the OS realizes that you're accessing the same file a lot, it will just keep that file in memory cache. If you're only mapping it with PROT_READ and PROT_SHARED, then it won't even need to duplicate the physical memory to a new page: it can just re-use existing physical memory with a new process-specific page table entry. This often ends up caching the file on first-access.

I had done some DNA calculations with fairly trivial 4-bit-wide data, each bit representing one of DNA basepairs (ACGT). The calculation was pure bitwise operations: or, and, shift, etc. When I reached the memory bus throughput limit, I decided I was done optimizing. The system had 1.5TB of RAM, so I'd cache the file just by reading it upon boot. Initially caching the file would take 10-15 minutes, but then the calculations would run across the whole 800GB file in about 30 seconds. There were about 2000-4000 DNA samples to calculate three or four times a day. Before all of this was optimized, the daily inputs would take close to 10-16 hours to run. By the time I was done, the server was mostly idle.

[0]: https://www.man7.org/linux/man-pages/man2/mmap.2.html

replies(1): >>45134043 #

5. inetknght ◴[04 Sep 25 23:53 UTC] No.45133572{3}[source]▶

>>45133433 #

> Would huge pages help with the mmap case?

Yes. Tens- or hundreds- of gigabytes of 4K page table entries take a while for the OS to navigate.

6. inetknght ◴[04 Sep 25 23:54 UTC] No.45133578{4}[source]▶

>>45133546 #

> I'd have look into that. Off the top of my head I don't know how you'd make that happen.

Pass these flags to your mmap call: (MAP_HUGETLB | MAP_HUGE_1GB)

replies(1): >>45133626 #

7. hsn915 ◴[04 Sep 25 23:56 UTC] No.45133595[source]▶

>>45132710 (OP) #

Shouldn't this be "io_uring is faster than mmap"?

I guess that would not get much engagement though!

That said, cool write up and experiment.

replies(3): >>45133612 #>>45134589 #>>45136289 #

8. nchmy ◴[04 Sep 25 23:56 UTC] No.45133597[source]▶

>>45133330 #

I just saw this post so am starting with Part 1. Could you replace the charts with ones on some sort of log scale? It makes it look like nothing happened til 2010, but I'd wager its just an optical illusion...

And, even better, put all the lines on the same chart, or at least with the same y axis scale (perhaps make them all relative to their base on the left), so that we can the relative rate of growth?

replies(1): >>45133734 #

9. jared_hulbert ◴[05 Sep 25 00:00 UTC] No.45133612[source]▶

>>45133595 #

Lol. Thanks.

replies(1): >>45135343 #

10. jared_hulbert ◴[05 Sep 25 00:02 UTC] No.45133626{5}[source]▶

>>45133578 #

Would this actually create huge page page cache entries?

replies(1): >>45133675 #

11. titanomachy ◴[05 Sep 25 00:05 UTC] No.45133648[source]▶

>>45132710 (OP) #

Very interesting article, thanks for publishing these tests!

Is the manual loop unrolling really necessary to get vectorized machine code? I would have guessed that the highest optimization levels in LLVM would be able to figure it out from the basic code. That's a very uneducated guess, though.

Also, curious if you tried using the MAP_POPULATE option with mmap. Could that improve the bandwidth of the naive in-memory solution?

> humanity doesn't have the silicon fabs or the power plants to support this for every moron vibe coder out there making an app.

lol. I bet if someone took the time to make a high-quality well-documented fast-IO library based on your io_uring solution, it would get use.

replies(1): >>45134258 #

12. john-h-k ◴[05 Sep 25 00:10 UTC] No.45133666[source]▶

>>45133330 #

You mention modern server CPUs have capability to “read direct to L3, skipping memory”. Can you elaborate on this?

replies(1): >>45133767 #

13. inetknght ◴[05 Sep 25 00:11 UTC] No.45133675{6}[source]▶

>>45133626 #

It's right in the documentation for mmap() [0]! And, from my experience, using it with an 800GB file provided a significant speed-up, so I do believe the documentation is correct ;)

And, you can poke around in the linux kernel's source code to determine how it works. I had a related issue that I ended up digging around to find the answer to: what happens if you use mremap() to expand the mapping and it fails; is the old mapping still valid or not? Answer: it's still valid. I found that it was actually fairly easy to read linux kernel C code, compared to a lot (!) of other C libraries I've tried to understand.

[0]: https://www.man7.org/linux/man-pages/man2/mmap.2.html

14. jared_hulbert ◴[05 Sep 25 00:21 UTC] No.45133734{3}[source]▶

>>45133597 #

I tried with the log scale before. They failed to express the exponential hockey stick growth unless you really spend the time with the charts and know what log scale is. I'll work on incorporating log scale due to popular demand. They do show the progress has been nice and exponential over time.

When I put the lines on the same chart it made the y axis impossible to understand. The units are so different. Maybe I'll revisit that.

Yeah around 2000-2010 the doubling is noticeable. Interestingly it's also when alot of factors started to stagnate.

replies(1): >>45134716 #

15. bawolff ◴[05 Sep 25 00:26 UTC] No.45133765[source]▶

>>45132710 (OP) #

Shouldn't you also compare to mmap with huge page option? My understanding is its presicely meant for this circumstance. I don't think its a fair comparison without it.

Respectfully, the title feels a little clickbaity to me. Both methods are still ultimately reading out of memory, they are just using different i/o methods.

replies(2): >>45134007 #>>45138806 #

16. comradesmith ◴[05 Sep 25 00:26 UTC] No.45133764[source]▶

>>45133330 #

Thanks for the article. What about using file reads from a mounted ramdisk?

replies(1): >>45134658 #

17. jared_hulbert ◴[05 Sep 25 00:27 UTC] No.45133767{3}[source]▶

>>45133666 #

https://www.intel.com/content/www/us/en/io/data-direct-i-o-t...

AMD has something similar.

The PCIe bus and memory bus both originate from the processor or IO die of the "CPU" when you use an NVMe drive you are really just sending it a bunch of structured DMA requests. Normally you are telling the drive to DMA to an address that maps to the memory, so you can direct it cache and bypass sending it out on the DRAM bus.

In theory... the specifics of what is supported exactly? I can't vouch for that.

replies(1): >>45134306 #

18. userbinator ◴[05 Sep 25 00:56 UTC] No.45133940[source]▶

>>45132710 (OP) #

...for sufficiently solid values of "disk" ;-)

replies(1): >>45134075 #

19. jared_hulbert ◴[05 Sep 25 01:06 UTC] No.45134007[source]▶

>>45133765 #

The original blog post title is intentionally clickbaity. You know, to bait people into clicking. Also I do want to challenge people to really think here.

Seeing if the cached file data can be accessed quickly is the point of the experiment. I can't get mmap() to open a file with huge pages.

void* buffer = mmap(NULL, size_bytes, PROT_READ, (MAP_HUGETLB | MAP_HUGE_1GB), fd, 0); doesn't work.

You can can see my code here https://github.com/bitflux-ai/blog_notes. Any ideas?

replies(2): >>45134269 #>>45134410 #

20. jared_hulbert ◴[05 Sep 25 01:10 UTC] No.45134043[source]▶

>>45133560 #

int fd = open(filename, O_RDONLY); void* buffer = mmap(NULL, size_bytes, PROT_READ, (MAP_HUGETLB | MAP_HUGE_1GB), fd, 0);

This doesn't work with a file on my ext4 volume. What am I missing?

replies(2): >>45134429 #>>45140871 #

21. jared_hulbert ◴[05 Sep 25 01:14 UTC] No.45134075[source]▶

>>45133940 #

I worked on SSDs for years. Too many people are suffering from insufficiently solid values of "disk" IMHO.

22. jared_hulbert ◴[05 Sep 25 01:46 UTC] No.45134258[source]▶

>>45133648 #

YES! gcc and clang don't like to optimize this. But they do if you hardcode the size_bytes to an aligned value. It kind of makes sense, what if a user passes size_bytes as 3? With enough effort the compilers could handle this, but it's a lot to ask.

I just ran MAP_POPULATE the results are interesting.

It speeds up the counting loop. Same speed or higher as the my read() to a malloced buffer tests.

HOWEVER... It takes a longer time overall to do the population of the buffer. The end result is it's 2.5 seconds slower to run the full test when compared to the original. I did not guess that one correctly.

time ./count_10_unrolled ./mnt/datafile.bin 53687091200 unrolled loop found 167802249 10s processed at 5.39 GB/s ./count_10_unrolled ./mnt/datafile.bin 53687091200 5.58s user 6.39s system 99% cpu 11.972 total time ./count_10_populate ./mnt/datafile.bin 53687091200 unrolled loop found 167802249 10s processed at 8.99 GB/s ./count_10_populate ./mnt/datafile.bin 53687091200 5.56s user 8.99s system 99% cpu 14.551 total

replies(2): >>45134454 #>>45135200 #

23. mastax ◴[05 Sep 25 01:48 UTC] No.45134269{3}[source]▶

>>45134007 #

MAP_HUGETLB can't be used for mmaping files on disk, it can only be used with MAP_ANONYMOUS, with a memfd, or with a file on a hugetlbfs pseudo-filesystem (which is also in memory).

replies(2): >>45134451 #>>45135606 #

24. josephg ◴[05 Sep 25 01:58 UTC] No.45134306{4}[source]▶

>>45133767 #

I’d be fascinated to see a comparison with SPDK. That bypasses the kernel’s NVMe / SSD driver and controls the whole device from user space - which is supposed to avoid a lot of copies and overhead.

You might be able to set up SPDK to send data directly into the cpu cache? It’s one of those things I’ve wanted to play with for years but honestly I don’t know enough about it.

https://spdk.io/

replies(1): >>45134505 #

25. jandrewrogers ◴[05 Sep 25 02:23 UTC] No.45134410{3}[source]▶

>>45134007 #

Read the man pages, there are restrictions on using the huge page option with mmap() that mean it won’t do what you might intuit it will in many cases. Getting reliable huge page mappings is a bit fussy on Linux. It is easier to control in a direct I/O context.

26. inetknght ◴[05 Sep 25 02:25 UTC] No.45134429{3}[source]▶

>>45134043 #

What issue are you having? Are you receiving an error? This is the kind of question that StackOverflow or perhaps an LLM might be able to help you with. I highly suggest reading the documentation for mmap to understand what issues could happen and/or what a given specific error code might indicate; see the NOTES section:

> Huge page (Huge TLB) mappings

> For mappings that employ huge pages, the requirements for the arguments of mmap() and munmap() differ somewhat from the requirements for mappings that use the native system page size.

> For mmap(), offset must be a multiple of the underlying huge page size. The system automatically aligns length to be a multiple of the underlying huge page size.

Ensure that the file is at least the page size, and preferably sized to align with a page boundary. Then also ensure that the length parameter (size_bytes in your example) is also aligned to a boundary.

There are also other important things to understand for these flags, which are described in the documentation, such as information available from /sys/kernel/mm/hugepages

https://www.man7.org/linux/man-pages/man2/mmap.2.html

replies(1): >>45139116 #

27. inetknght ◴[05 Sep 25 02:28 UTC] No.45134451{4}[source]▶

>>45134269 #

> MAP_HUGETLB can't be used for mmaping files on disk

False. I've successfully used it to memory-map networked files.

replies(4): >>45134599 #>>45134638 #>>45135603 #>>45140875 #

28. mischief6 ◴[05 Sep 25 02:28 UTC] No.45134454{3}[source]▶

>>45134258 #

it could be interesting to see what ispc does with similar code.

29. jared_hulbert ◴[05 Sep 25 02:39 UTC] No.45134505{5}[source]▶

>>45134306 #

spdk and I go way back. I'm confident it'd be about the same, possibly ~200-300MB/s more, I was pretty close to the rated throughput of the drives. Io_uring has really closed the gap that used to exist between the in kernel and userspace solutions.

With the Intel connection they might have explicit support for DDIO. Good idea.

replies(1): >>45139208 #

30. dang ◴[05 Sep 25 02:58 UTC] No.45134589[source]▶

>>45133595 #

Let's use that. Since HN's guidelines say ""Please use the original title, unless it is misleading or linkbait", that "unless" clause seems to kick in here, so I've changed the title above. Thanks!

If anyone can suggest a better title (i.e. more accurate and neutral) we can change it again.

31. loloquwowndueo ◴[05 Sep 25 02:59 UTC] No.45134599{5}[source]▶

>>45134451 #

Share your code?

replies(2): >>45134637 #>>45140890 #

32. inetknght ◴[05 Sep 25 03:07 UTC] No.45134637{6}[source]▶

>>45134599 #

I don't work there any more (it was a decade ago) and I'm pretty busy right now with a new job coming up (offered today).

Do you have kernel documentation that says that hugetlb doesn't work for files? I don't see that stated anywhere.

replies(1): >>45136551 #

33. minitech ◴[05 Sep 25 03:07 UTC] No.45134638{5}[source]▶

>>45134451 #

That doesn’t sound like the intended meaning of “on disk”.

replies(1): >>45134653 #

34. inetknght ◴[05 Sep 25 03:11 UTC] No.45134653{6}[source]▶

>>45134638 #

Kernel doesn't really care about "on disk", it cares about "on filesystem".

The "on disk" distinction is a simplification.

replies(1): >>45134845 #

35. jared_hulbert ◴[05 Sep 25 03:13 UTC] No.45134658{3}[source]▶

>>45133764 #

Hmm. tmpfs was slower. hugetlbfs wasn't working for me.

36. juancn ◴[05 Sep 25 03:14 UTC] No.45134666[source]▶

>>45132710 (OP) #

    Because PCIe bandwidth is higher than memory bandwidth

This doesn't sound right, a PCIe 5.0 x16 slot offers up to 64 GB/s. That's fully saturated, a fairly old Xeon server can sustain >100 GB/s memory reads per numa node without much trouble.

Some newer HBM enabled, like a Xeon Max 9480 can go over 1.6TBs for HBM (up to 64GB) and DDR5 can reach > 300 GB/s.

Even saturating all PCIe lanes (196 on a dual socket Xeon 6), you could at most theoretically get ~784GB/s, which coincidentally is the max memory bandwidth of such CPUs (12 Channels x 8,800 MT/s = 105,600 MT/s total bandwidth or roughly ~784GB/s).

I mean, solid state IO is getting really close, but it's not so fast on non-sequential access patterns.

I agree that many workloads could be shifted to SSDs but it's still quite nuanced.

replies(1): >>45134692 #

37. jared_hulbert ◴[05 Sep 25 03:18 UTC] No.45134692[source]▶

>>45134666 #

Not by a ton but if you add up the DDR5 channel bandwidth and the PCIe lanes most systems the PCIe bandwidth is higher. Yes. HBM and L3 cache will be higher than the PCIe.

38. nchmy ◴[05 Sep 25 03:24 UTC] No.45134716{4}[source]▶

>>45133734 #

The hockey stick growth is the entire problem - it's an optical illusion resulting from the fact that going from 100 to 200 is the same rate as 200 to 400. And 800, 1600. You understand exponents.

Log axis solves this, and turns meaningless hockey sticks into generally a straightish line that you can actually parse. If it still deviates from straight, then you really know there's true changes in the trendline.

Lines on same chart can all be divided by their initial value, anchoring them all at 1. Sometimes they're still a mess, but it's always worth a try.

You're enormously knowledgeable and the posts were fascinating. But this is stats 101. Not doing this sort of thing, especially explicitly in favour of showing a hockey stick, undermines the fantastic analysis.

39. modeless ◴[05 Sep 25 03:25 UTC] No.45134728[source]▶

>>45132710 (OP) #

Wait, PCIe bandwidth is higher than memory bandwidth now? That's bonkers, when did that happen? I haven't been keeping up.

Just looked at the i9-14900k and I guess it's true, but only if you add all the PCIe lanes together. I'm sure there are other chips where it's even more true. Crazy!

replies(4): >>45134749 #>>45134790 #>>45135433 #>>45136511 #

40. adgjlsfhk1 ◴[05 Sep 25 03:30 UTC] No.45134749[source]▶

>>45134728 #

on server chips it's kind of ridiculous. 5th gen Epyc has 128 lanes of PCIEx5 for over 1TB/s of pcie bandwith (compared to 600GB/s RAM bandwidth from 12 channel ddr5 at 6400)

replies(1): >>45134783 #

41. andersa ◴[05 Sep 25 03:40 UTC] No.45134783{3}[source]▶

>>45134749 #

Your math is a bit off. 128 lanes gen5 is 8 times x16, which has a combined theoretical bandwidth of 512GB/s, and more like 440GB/s in practice after protocol overhead.

Unless we are considering both read and write bandwidth, but that seems strange to compare to memory read bandwidth.

replies(2): >>45134856 #>>45135199 #

42. DiabloD3 ◴[05 Sep 25 03:41 UTC] No.45134790[source]▶

>>45134728 #

"No."

DDR5-8000 is 64GB/s per channel. Desktop CPUs have two channels. PCI-E 5.0 in x16 is 64GB/s. Desktops have one x16.

replies(3): >>45134821 #>>45135236 #>>45136770 #

43. themafia ◴[05 Sep 25 03:43 UTC] No.45134800[source]▶

>>45132710 (OP) #

no madvise() call with MADV_SEQUENTIAL?

44. modeless ◴[05 Sep 25 03:48 UTC] No.45134821{3}[source]▶

>>45134790 #

Hmm, Intel specs the max memory bandwidth as 89.6 GB/s. DDR5-8000 would be out of spec. But I guess it's pretty common to run higher specced memory, while you can't overclock PCIe (AFAIK?). Even so I guess my math was wrong, it doesn't quite add up to more than memory bandwidth. But it's pretty darn close!

replies(1): >>45135087 #

45. pclmulqdq ◴[05 Sep 25 03:54 UTC] No.45134845{7}[source]▶

>>45134653 #

The kernel absolutely does care about the "on disk" distinction because it determines what driver to use.

replies(1): >>45135217 #

46. pclmulqdq ◴[05 Sep 25 03:58 UTC] No.45134856{4}[source]▶

>>45134783 #

People like to add read and write bandwidth for some silly reason. Your units are off, too, though: gen 5 is 32 GT/s, meaning 64 GB/s (or 512 gigabits per second) each direction on an x16 link.

replies(1): >>45134869 #

47. andersa ◴[05 Sep 25 04:01 UTC] No.45134869{5}[source]▶

>>45134856 #

I meant for all 128 lanes being used, not each x16. Then you get 512GB/s.

48. lowbloodsugar ◴[05 Sep 25 04:11 UTC] No.45134914[source]▶

>>45132710 (OP) #

Someone who’s read it in more detail, it looks like the uring code is optimized for async, while the mmap code doesn’t do any prefetching so just chokes when the OS has to do work?

replies(1): >>45135160 #

49. DiabloD3 ◴[05 Sep 25 04:52 UTC] No.45135087{4}[source]▶

>>45134821 #

There is a difference between recommended and max achievable.

Zen 5 can hit that (and that's what I run), and Arrow Lake can also.

The recommended from AMD on Zen 4 and 5 is 6000 (or 48x2), for Arrow Lake is 6400 (or 51.2x2); both of them continue increase in performance up to 8000, both of them have extreme trouble going past 8000 and getting a stable machine.

50. wahern ◴[05 Sep 25 05:08 UTC] No.45135160[source]▶

>>45134914 #

My first thought is that what's different here isn't async, per se, but parallelism. io_uring uses a kernel thread pool to service I/O requests, so you actually end up with multiple threads running in parallel handling bookkeeping work. AFAIU, SSD controllers also can service requests in parallel, even if the request stream is serialized. These two sources of parallelism is why the I/O results come back out-of-order.

Generic readahead, which is what the mmap case is relying on, benefits from at least one async thread running in parallel, but I suspect for any particular file you effectively get at most one thread running in parallel to fill the page cache.

What may also be important is the VM management. The splice and vmsplice syscalls came about because someone requested that Linux adopt a FreeBSD optimization--for sufficiently sized write calls (i.e. page size or larger), the OS would mark the page(s) CoW and zero-copy the data to disk or the network. But Linus measured that the cost of fiddling with VM page attributes on each call was too costly and erased most of the zero-copy benefit. So another thing to take note of is that the io_uring case doesn't induce any page faults at all or require any costly VM fiddling (the shared io_uring buffers are installed upfront), whereas in the mmap case there are many page faults and fixups, possibly as many as one for every 4K page. The io_uring case may even result in additional data copies, but with less cost than the VM fiddling, which is even greater now than 20 years ago.

51. wmf ◴[05 Sep 25 05:18 UTC] No.45135199{4}[source]▶

>>45134783 #

PCIe is full duplex while DDR5 is half duplex so in theory PCIe is higher. It's rare to max out PCIe in both directions though.

replies(1): >>45135223 #

52. titanomachy ◴[05 Sep 25 05:18 UTC] No.45135200{3}[source]▶

>>45134258 #

Hmm, I expected some slowdown from POPULATE, but I thought it would still come out ahead. Interesting!

53. ddtaylor ◴[05 Sep 25 05:23 UTC] No.45135217{8}[source]▶

>>45134845 #

The interface is handled by the kernel.

54. mrcode007 ◴[05 Sep 25 05:24 UTC] No.45135223{5}[source]▶

>>45135199 #

happens frequently in fact when training neural nets on modern hw

55. pseudosavant ◴[05 Sep 25 05:28 UTC] No.45135236{3}[source]▶

>>45134790 #

One x16 slot. They'll use PCIe lanes in other slots (x4, x1, M2 SSDs) and also for devices off the chipset (network, USB, etc). The current top AMD/Intel CPUs can do ~100GB/sec over 28 lanes of mostly PCIe 5.

56. pixelpoet ◴[05 Sep 25 05:29 UTC] No.45135241[source]▶

>>45132710 (OP) #

> A few notes for the "um actually" haters commenting on Hacker News

Stay classy; any criticism is of course "hating", right?

The fact that your title is clickbaity and your results suspect should encourage you to get the most accurate picture, not shoot the messenger.

replies(2): >>45136013 #>>45136646 #

57. whizzter ◴[05 Sep 25 05:47 UTC] No.45135337[source]▶

>>45133330 #

Like people mention, hugetlb,etc could be an improvement, but the core issue holding it it down probably has to do with mmap, 4k pages and paging behaviours, mmap will cause faults for each "small" 4k page not in memory, causing a kernel jump and then whatever machinery to fill in the page-cache (and bring up data from disk with the associated latency).

This in contrast with the io_uring worker method where you keep the thread busy by submitting requests and letting the kernel do the work without expensive crossings.

The 2g fully in-mem shows the CPU's real perf, the dip to 50gb is interesting, perhaps when going over 50% memory the Linux kernel evicts pages or something similar that is hurting perf, maybe plot a graph of perf vs test-size to see if there is an obvious cliff.

replies(2): >>45141276 #>>45142133 #

58. MaxikCZ ◴[05 Sep 25 05:48 UTC] No.45135343{3}[source]▶

>>45133612 #

Its not even about clickbait for me, but I really dont want to go parse an article to figure out what is meant by "Memory is slow, Disk is fast". You want "clickbait" to make people click and think, we want descriptive tittles to know what the article is about before we read it. That used to be original purpose of tittles, we like it that way.

Its like as if youd label your food product "you wont believe this", and forced customers to figure what it is from ingredients list.

replies(2): >>45136268 #>>45141749 #

59. ayende ◴[05 Sep 25 05:57 UTC] No.45135399[source]▶

>>45132710 (OP) #

This is wrong, because your mmap code is being stalled for page faults (including soft page faults that you have when the data is in memory, but not mapped to your process).

The io_uring code looks like it is doing all the fetch work in the background (with 6 threads), then just handing the completed buffers to the counter.

Do the same with 6 threads that would first read the first byte on each page and then hand that page section to the counter, you'll find similar performance.

And you can use both madvice / huge pages to control the mmap behavior

replies(4): >>45135629 #>>45138707 #>>45140052 #>>45147766 #

60. AnthonyMouse ◴[05 Sep 25 06:06 UTC] No.45135433[source]▶

>>45134728 #

> Wait, PCIe bandwidth is higher than memory bandwidth now?

Hmm.

Somebody make me a PCIe card with RDIMM slots on it.

replies(1): >>45135538 #

61. thulle ◴[05 Sep 25 06:26 UTC] No.45135538{3}[source]▶

>>45135433 #

https://www.servethehome.com/inventec-96-dimm-cxl-expansion-...

https://www.servethehome.com/micron-cz120-cxl-memory-module-...

62. squirrellous ◴[05 Sep 25 06:38 UTC] No.45135603{5}[source]▶

>>45134451 #

This is quite interesting since I, too, was under the impression that mmap cannot be used on disk-backed files with huge pages. I tried and failed to find any official kernel documentation around this, but I clearly remember trying to do this at work (on a regular ECS machine with Ubuntu) and getting errors.

Based on this SO discussion [1], it is possibly a limitation with popular filesystems like ext4?

If anyone knows more about this, I'd love to know what exactly are the requirements for using hugepages this way.

[1] https://stackoverflow.com/questions/44060678/huge-pages-for-...

replies(2): >>45136878 #>>45140880 #

63. mananaysiempre ◴[05 Sep 25 06:38 UTC] No.45135606{4}[source]▶

>>45134269 #

It looks like there is in theory support for that[1]? But the patches for ext4[2] did not go through.

[1] https://lwn.net/Articles/686690/

[2] https://lwn.net/Articles/718102/

64. lucketone ◴[05 Sep 25 06:45 UTC] No.45135629[source]▶

>>45135399 #

It would seem you summarised whole post.

That’s the point: “mmap” is slow because it is serial.

replies(1): >>45136283 #

65. unwind ◴[05 Sep 25 07:45 UTC] No.45136013[source]▶

>>45135241 #

I can bite (softly) on part of that, since there is C code in the post. :)

This:

    size_t count = 0;
    /// ... code to actually count elided ...
    printf("Found %ld 10s\n", count);

is wrong, since `count` has type `size_t` you should print it using `%zu` which is the dedicated purpose-built formatting code for `size_t` values. Also passing an unsigned value to `%d` which is for (signed) `int` is wrong, too.

The (C17 draft) standard says "If any argument is not the correct type for the corresponding conversion specification, the behavior is undefined" so this is not intended as pointless language-lawyering, it's just that it can be important to get silly details like this right in C.

66. ◴[05 Sep 25 07:56 UTC] No.45136094[source]▶

>>45132710 (OP) #

67. avallach ◴[05 Sep 25 07:58 UTC] No.45136107[source]▶

>>45132710 (OP) #

Maybe I'm misunderstanding, but after reading it sounds to me not like "io_uring is faster than mmap" but "raid0 with 8 SSDs has more throughput than 3 channel DRAM".

replies(1): >>45136255 #

68. nine_k ◴[05 Sep 25 08:23 UTC] No.45136255[source]▶

>>45136107 #

The title has been edited incorrectly. The original page title is "Memory is slow, Disk is fast", and it states exactly what you say: an NVMe RAID can offer more bandwidth than RAM.

replies(1): >>45138212 #

69. robertlagrant ◴[05 Sep 25 08:26 UTC] No.45136268{4}[source]▶

>>45135343 #

> Its like as if youd label your food product "you wont believe this", and forced customers to figure what it is from ingredients list.

Indeed[0].

[0] https://en.wikipedia.org/wiki/I_Can't_Believe_It's_Not_Butte...!

replies(1): >>45143616 #

70. arghwhat ◴[05 Sep 25 08:29 UTC] No.45136283{3}[source]▶

>>45135629 #

mmap isn't "serial", the code that was using the mapping was "serial". The kernel will happily fill different portions of the mapping in parallel if you have multiple threads fault on different pages.

(That doesn't undermine that io_uring and disk access can be fast, but it's comparing a lazy implementation using approach A with a quite optimized one using approach B, which does not make sense.)

replies(4): >>45136633 #>>45136749 #>>45136761 #>>45136970 #

71. nine_k ◴[05 Sep 25 08:29 UTC] No.45136289[source]▶

>>45133595 #

No. "io_uring faster than mmap" is sort of a truism: sequential page faults are slower than carefully orchestrated async I/O. The point of the article is that reading directly from a PCIe device, such as an NVMe flash, can actually be faster than caching things in RAM first.

replies(1): >>45140950 #

72. nextaccountic ◴[05 Sep 25 08:42 UTC] No.45136365[source]▶

>>45132710 (OP) #

The real difference is that with io_uring and O_DIRECT you manage the cache yourself (and can't share with other processes, and the OS can't reclaim the cache automatically if under memory pressure), and with mmap this is managed by the OS.

If Linux had an API to say "manage this buffer you handled me from io_uring as if it were a VFS page cache (and as such it can be shared with other processes, like mmap), if you want it back just call this callback (so I can cleanup my references to it) and you are good to go", then io_uring could really replace mmap.

What Linux has currently is PSI, which lets the OS reclaim memory when needed but doesn't help with the buffer sharing thing

replies(1): >>45140261 #

73. rwmj ◴[05 Sep 25 09:04 UTC] No.45136511[source]▶

>>45134728 #

That's the promise (or requirement?) of CXL - have your memory managed centrally, servers access it over PCIe. https://en.wikipedia.org/wiki/Compute_Express_Link I wonder how many are actually using CXL. I haven't heard of any customers deploying it so far.

74. Sesse__ ◴[05 Sep 25 09:12 UTC] No.45136551{7}[source]▶

>>45134637 #

It's filesystem-dependent. In particular, tmpfs will work. To the best of my knowledge, no “normal” filesystems (e.g., ext4, xfs) will.

replies(1): >>45145898 #

75. nialv7 ◴[05 Sep 25 09:23 UTC] No.45136610[source]▶

>>45132710 (OP) #

Things to try: MADV_SEQUENTIAL, hugetlb, background prefetching thread.

76. amelius ◴[05 Sep 25 09:26 UTC] No.45136633{4}[source]▶

>>45136283 #

OK, so we need a comparison between a multi threaded mmap approach and io_uring. Which would be faster?

replies(1): >>45136885 #

77. menaerus ◴[05 Sep 25 09:28 UTC] No.45136646[source]▶

>>45135241 #

People should just ignore such low-quality material and stop feeding the troll. Information found in both the first and second part of "Memory is slow, disk is fast" series is wrong on so many levels that it isn't worth correcting or commenting. It is obviously written with the help of the AI without actually fact-checking at all and all under the impression that the author is worthwhile which he isn't.

Just look at this bs:

> Early x86 processors took a few clocks to execute most instructions, modern processors have been able parallelize to where they can actually execute 2 instructions every clock.

replies(1): >>45136744 #

78. gpderetta ◴[05 Sep 25 09:43 UTC] No.45136744{3}[source]▶

>>45136646 #

wait, P5 is not modern any more ? :D

replies(1): >>45147885 #

79. immibis ◴[05 Sep 25 09:44 UTC] No.45136749{4}[source]▶

>>45136283 #

How do you do embarrassingly async memory access with mmap?

replies(1): >>45137619 #

80. kragen ◴[05 Sep 25 09:44 UTC] No.45136753[source]▶

>>45132710 (OP) #

This is pretty great. I only learned to use perf_events to see annotated disassembly a few weeks ago, although I don't know how to interpret what I see there yet.

I suspect the slowness identified with mmap() here is somewhat fixable, for example by mapping already-in-RAM pages somewhat more eagerly. So it wouldn't be surprising to me (though see above for how much I'm not an expert) if next year mmap were faster than io_uring again.

replies(1): >>45141994 #

81. ◴[05 Sep 25 09:45 UTC] No.45136761{4}[source]▶

>>45136283 #

82. immibis ◴[05 Sep 25 09:47 UTC] No.45136770{3}[source]▶

>>45134790 #

But my Threadripper has 4 channels of DDR5, and the equivalent of 4.25 x16 PCIe 5.

You know what adds up to an even bigger number though? Using both.

83. bawolff ◴[05 Sep 25 10:06 UTC] No.45136878{6}[source]▶

>>45135603 #

Trying to google this i found https://lwn.net/Articles/718102/ which suggests that there was discussion about it back in 2017. But i can't find anything else about it except a patchset that i guess wasnt merged (?). So maybe it was just a proposal that never made it in.

Honestly i never knew any of this i thought huge pages just worked for all of mmap.

84. nabla9 ◴[05 Sep 25 10:08 UTC] No.45136885{5}[source]▶

>>45136633 #

If the memory access pattern is the same, there are no significant differences.

replies(1): >>45145423 #

85. rafaelmn ◴[05 Sep 25 10:21 UTC] No.45136970{4}[source]▶

>>45136283 #

OK this is not my level of stack for over a decade now, but writing a multithreaded code that will generate the same pagefaults on a shared mmap buffer, as opposed to something that kernel io scheduler will do on your behalf, and presumably try to schedule optimally for your machine/workload - does not sound comparable.

Thats like arguing python is not slower than C++ because you could technically write a specialized AOT compiler for your python code that would generate equivalent assembly so in the end it is the same ?

86. kat529770 ◴[05 Sep 25 11:39 UTC] No.45137420[source]▶

>>45132710 (OP) #

Performance claims aside, the real win with io_uring is how much control it gives you over async I/O without the syscall overhead. mmap’s great for simplicity, but once you hit high-concurrency or multi-buffer use cases, io_uring starts flexing. Anyone benchmarked it with real-world workloads (e.g., DB-backed APIs or log ingestion)?

87. mellosouls ◴[05 Sep 25 11:49 UTC] No.45137468[source]▶

>>45132710 (OP) #

Actual, and more accessible title (did it change?):

Memory is slow, Disk is fast - Part 2

88. worldsavior ◴[05 Sep 25 12:03 UTC] No.45137573[source]▶

>>45132710 (OP) #

Probably off-topic, but could someone tell me what is the current status of io_uring from a security standpoint? It's still disabled on Android and some Linux distros, but it's being said it's safe now.

replies(2): >>45138799 #>>45140562 #

89. lordgilman ◴[05 Sep 25 12:11 UTC] No.45137619{5}[source]▶

>>45136749 #

You dereference a pointer.

replies(1): >>45138631 #

90. kentonv ◴[05 Sep 25 13:17 UTC] No.45138212{3}[source]▶

>>45136255 #

No, the title edit is fair, where the original title is misleading.

Obviously, no matter how you read from disk, it has to go through RAM. Disk bandwidth cannot exceed memory bandwidth.*

But what the article actually tests is a program that uses mmap() to read from page cache, vs. a program that uses io_uring to read directly from disk (with O_DIRECT). You'd think the mmap() program would win, because the data in page cache is already in memory, whereas the io_uring program is explicitly skipping cache and pulling from disk.

However, the io_uring program uses 6 threads to pull from disk, which then feed into one thread that sequentially processes the data. Whereas the program using mmap() uses a single thread for everything. And even though the mmap() is pulling from page cache, that single thread still has to get interrupted by page faults as it reads, because the kernel does not proactively map the pages from cache even if they are available (unless, you know, you tell it to, with madvise() etc., but the test did not). So the mmap() test has one thread that has to keep switching between kernel and userspace and, surprise, that is not as fast as a thread which just stays in userspace while 6 other threads feed it data.

To be fair, the article says all this, if you read it. Other than the title being cheeky it's not hiding anything.

* OK, the article does mention that there exists CPUs which can do I/O directly into L3 cache which could theoretically beat memory bandwidth, but this is not actually something that is tested in the article.

91. daniele_dll ◴[05 Sep 25 13:54 UTC] No.45138624[source]▶

>>45132710 (OP) #

It's always nice to see someone discovering that mmap is not built for performance but for easyness of data-on-disk management with reasonable speed.

92. icedchai ◴[05 Sep 25 13:55 UTC] No.45138631{6}[source]▶

>>45137619 #

From the application perspective, it's not truly async. On a deference, your app may be blocked indefinitely as data is paged into memory. In the early 2000's I worked on systems that made heavy use of mmap. In constrained ("dev") environments with slow disks, you could be blocked for several seconds...

replies(1): >>45138968 #

93. mrlongroots ◴[05 Sep 25 14:02 UTC] No.45138707[source]▶

>>45135399 #

Yes, it doesn't take a benchmark to find out that storage can not be faster than memory.

Even if you had a million SSDs and somehow were able to connect them to a single machine somehow, you would not outperform memory, because the data needs to be read into memory first, and can only then be processed by the CPU.

Basic `perf stat` and minor/major faults should be a first-line diagnostic.

replies(3): >>45139067 #>>45143065 #>>45152315 #

94. p_l ◴[05 Sep 25 14:08 UTC] No.45138799[source]▶

>>45137573 #

It's more that io_uring has not enough integration with LSMs and the like, so for example on Android it's disabled because Android heavily uses SElinux which would be bypassed by io_uring

95. mrlongroots ◴[05 Sep 25 14:09 UTC] No.45138806[source]▶

>>45133765 #

You don't need hugepages for basic 5GB/s sequential scans. I don't know the exact circumstances that would cause TLB pressure, but this is not it.

You can maybe reduce the number of page faults, but you can do that by walking the mapped address space once before the actual benchmark too.

96. jlokier ◴[05 Sep 25 14:23 UTC] No.45138968{7}[source]▶

>>45138631 #

This branch of the discussion is is about dereferencing on multiple threads concurrently. That doesn't block the application, each mmap'd dereference only blocks its own thread (same as doing read()).

In my own measurements with NVMe RAID, doing this works very well on Linux for storage I/O.

I was getting similar performance to io_uring with O_DIRECT, and faster performance when the data is likely to be in the page cache on multiple runs, because the multi-threaded mmap method shares the kernel page cache without copying data.

To measure this, replace the read() calls in the libuv thread pool function with single-byte dereferences, mmap a file, and call a lot of libuv async reads. That will make libuv do the dereferences in its thread pool and return to the main application thread having fetched the relevant pages. Make sure libuv is configured to use enough threads, as it doesn't use enough by default.

replies(1): >>45143826 #

97. alphazard ◴[05 Sep 25 14:32 UTC] No.45139067{3}[source]▶

>>45138707 #

> storage can not be faster than memory

This is an oversimplification. It depends what you mean by memory. It may be true when using NVMe on modern architectures in a consumer use case, but it's not true about computer architecture in general.

External devices can have their memory mapped to virtual memory addresses. There are some network cards that do this for example. The CPU can load from these virtual addresses directly into registers, without needing to make a copy to the general purpose fast-but-volatile memory. In theory a storage device could also be implemented in this way.

replies(3): >>45140329 #>>45143170 #>>45147924 #

98. vlovich123 ◴[05 Sep 25 14:37 UTC] No.45139116{4}[source]▶

>>45134429 #

The most common problem is the system not having any support for hugetable allocated. Don’t have the specific things that need to be configured handy.

99. benlwalker ◴[05 Sep 25 14:45 UTC] No.45139208{6}[source]▶

>>45134505 #

SPDK will be able to fully saturate the PCIe bandwidth from a single CPU core here (no secret 6 threads inside the kernel). The drives are your bottleneck so it won't go faster, but it can use a lot less CPU.

But with SPDK you'll be talking to the disk, not to files. If you changed io_uring to read from the disk directly with O_DIRECT, you wouldn't have those extra 6 threads either. SPDK would still be considerably more CPU efficient but not 6x.

DDIO is a pure hardware feature. Software doesn't need to do anything to support it.

Source: SPDK co-creator

100. arunc ◴[05 Sep 25 15:59 UTC] No.45140052[source]▶

>>45135399 #

Indeed. Use with mmap with MAP_POPULATE which will pre populate.

replies(1): >>45143586 #

101. touisteur ◴[05 Sep 25 16:16 UTC] No.45140261[source]▶

>>45136365 #

Yes Linus has been ranting for decades against O_DIRECT saying similar things (aka better hints on pages and cache usage).

The notorious archive of Linus rants on [0] starts with "The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances". It gets better afterwards, though I'm not clear whether his articulated vision is implemented yet.

[0] https://yarchive.net/comp/linux/o_direct.html

replies(1): >>45141128 #

102. Thaxll ◴[05 Sep 25 16:22 UTC] No.45140329{4}[source]▶

>>45139067 #

I mean even with their memory mapped, the physical access to the memory ( NVME ) will always be slower than the physical access to the RAM, right?

Which external device has memory access as fast or faster than generic RAM?

replies(1): >>45140586 #

103. yencabulator ◴[05 Sep 25 16:38 UTC] No.45140562[source]▶

>>45137573 #

It's a complex shared-memory protocol implemented in C. It'll take a while to settle down fully, and it's still getting new features. If you're targeted by state-level actors, contain or avoid; if not, upgrade regularly.

It's all a question of risk management; for example, Google has historically used container-based sandboxes for their own code (even before Linux containers were a thing), and there an io_uring vulnerability could expose them to attacks by any swdev employee. And for real performance where needed the big boys are bypassing the kernel networking and block I/O stacks anyway (load balancers, ML, ...).

I think the real question to ask is why are you running hostile code outside a dedicated VM? Lots of places will happily give you root inside a VM, and in that context io_uring attacks are irrelevant. That trust boundary is probably just as complex (KVM, virtio, very similar ringbuffers as io_uring really), but the trusted side these days is often Rust and more trustworthy.

For "non-hostile code", frankly other attacks are typically simpler. That's likely the stuff your devs run on their workstations all the time. It likely has direct access to the family jewels and networking at the same time, without needing to use any exploit.

The real fix is to slowly push the industry off of C/C++, and figure out how to use formal methods to reason about shared-memory protocols better. For example, if your "received buffer" abstraction only lets you read every byte exactly once, you can't be vulnerable to TOCTOU. That'd be pretty easy to do safely but the whole reason a shared-memory protocols was used in the first place was performance, and that trade-off is a lot less trivial.

104. usefulcat ◴[05 Sep 25 16:40 UTC] No.45140586{5}[source]▶

>>45140329 #

I have heard that some Intel NICs can put received data directly into L3 cache. That would definitely make it faster to access than if it were in main RAM.

If a NIC can do that over PCI, probably other PCI devices could do the same, at least in theory.

replies(1): >>45145204 #

105. yencabulator ◴[05 Sep 25 16:49 UTC] No.45140699[source]▶

>>45132710 (OP) #

> Memory is slow, Disk is fast - Part 2.

Bullshit clickbait title. More like "naive algorithm is slower than prefetching". Hidden at the end of the article:

  >  Memory is slow - when you use it oldschool.
  >  Disk is fast - when you are clever with it.

The author spent a lot of time benchmarking a thing completely unrelated to the premise of the article. And the only conclusion to be drawn from the benchmark is utterly unsurprising.

---

Linux mmap behavior has two features that can hurt, but this article does not deliver that sermon well. Here's what to worry about with mmap:

- for reads, cache misses are unpredictable, and stall more expensive resources than an outstanding io_uring request - for writes, the atomicity story is very hard to get right, and unpredictable writeback delay stalls more expensive resources than an outstanding io_uring request (very few real world production systems with durability write through mmap; you can use conventional write APIs with read-side mmap)

106. inetknght ◴[05 Sep 25 17:05 UTC] No.45140871{3}[source]▶

>>45134043 #

My bad, don't use `MAP_HUGETLB`, just use `MAP_HUGE_1GB`.

See a quick example I whipped up here: https://github.com/inetknght/mmap-hugetlb

107. inetknght ◴[05 Sep 25 17:05 UTC] No.45140875{5}[source]▶

>>45134451 #

My bad, don't use `MAP_HUGETLB`, just use `MAP_HUGE_1GB`.

See a quick example I whipped up here: https://github.com/inetknght/mmap-hugetlb

replies(2): >>45141621 #>>45142964 #

108. inetknght ◴[05 Sep 25 17:05 UTC] No.45140880{6}[source]▶

>>45135603 #

My bad, don't use `MAP_HUGETLB`, just use `MAP_HUGE_1GB`.

See a quick example I whipped up here: https://github.com/inetknght/mmap-hugetlb

replies(1): >>45147460 #

109. johnisgood ◴[05 Sep 25 17:05 UTC] No.45140882[source]▶

>>45132710 (OP) #

splice / vmsplice with pipes would probably be the fastest for the use case of file -> pipe -> user-space -> CPU processing. Problem is, you eventually have to access user-space memory to count integers, which breaks the zero-copy advantage for this workload. It probably won't beat the already vectorized O_DIRECT method.

What about direct-access hardware with DMA to CPU caches (NUMA-aware / cache-pinned)? PCIe NVMe controllers can use DMA to user-space buffers (O_DIRECT or SPDK) and if buffers are pinned to CPU local caches, then you can avoid main memory latency. It does require SPDK (Storage Performance Development Kit) or similar user-space NVMe drivers, however. IMO this is likely faster than io_uring + O_DIRECT, especially on large datasets, the only problem is that it requires specialized user-space NVMe stacks and careful buffer alignment.

You could also aggressively unroll loops and use prefetch instructions to hide memory latency, i.e. SIMD / AVX512 vectorization with prefetching. The blog post used AVX2 (128/256-bit) but on a modern CPU, AVX512 can process 512 bits (16 integers) per instruction and manual prefetching of data can reduce L1/L2 cache misses. I think this could beat AVX counting from the blog.

As for MAP_HUGETLB, 4KB page faults add overhead with a normal mmap. You want to map large pages (2 MB / 1 GB) which means fewer page table walks and reduced TLB misses reduced, meaning on a 50 GB dataset, using hugepages can reduce kernel overhead and speed up mmap-based counting, but it is still likely slightly slower than O_DIRECT + vectorized counting, so disregard.

TL;DR:

I think the faster practical CPU method is: O_DIRECT + aligned buffer + AVX512 + unrolled counting + NUMA-aware allocation.

Theoretically fastest (requires specialized setup): SPDK / DMA -> CPU caches + AVX512, bypassing main memory.

Mmap or page-cache approaches are neat but always slower for large sequential workloads.

Just my two cents. Thoughts?

110. inetknght ◴[05 Sep 25 17:06 UTC] No.45140890{6}[source]▶

>>45134599 #

My bad, don't use `MAP_HUGETLB`, just use `MAP_HUGE_1GB`.

See a quick example I whipped up here: https://github.com/inetknght/mmap-hugetlb

111. wmf ◴[05 Sep 25 17:11 UTC] No.45140950{3}[source]▶

>>45136289 #

reading directly from a PCIe device, such as an NVMe flash, can actually be faster than caching things in RAM first.

That's not true though, because the PCIe device DMAs into RAM anyway.

replies(1): >>45145587 #

112. jandrewrogers ◴[05 Sep 25 17:26 UTC] No.45141128{3}[source]▶

>>45140261 #

I know people like to post this rant but in this case Linus simply doesn't understand the problem domain. O_DIRECT is commonly used in contexts where the fundamental mechanisms of the kernel cache are inappropriate. It can't be fixed with hints.

As a database example, there are major classes of optimization that require perfect visibility into the state of the entire page cache with virtually no overhead and strict control over every change of state that occurs. O_DIRECT allows you to achieve this. The optimizations are predicated on the impossibility of an external process modifying state. It requires perfect control of the schedule which is invalidated if the kernel borrows part of the page cache. Whether or not the kernel asks nicely doesn't matter, it breaks a design invariant.

The Linus rant is from a long time ago. Given the existence of things like io_uring which explicitly enables this type of behavior almost to the point of encouraging it, Linus may understand the use cases better now.

replies(1): >>45141608 #

113. pianom4n ◴[05 Sep 25 17:38 UTC] No.45141276{3}[source]▶

>>45135337 #

The in-memory solution creates a 2nd copy of the data so 50GB doesn't fit in memory anymore. The kernel is forced to drop and then reload part of the cached file.

114. touisteur ◴[05 Sep 25 18:04 UTC] No.45141608{4}[source]▶

>>45141128 #

I discovered his rant(s) about this recently and indeed thought it was interesting in the light of io_uring. If there's a similar compendium of Linus rants against io_uring I'm interested.

115. jared_hulbert ◴[05 Sep 25 18:05 UTC] No.45141621{6}[source]▶

>>45140875 #

Adding MAP_HUGE_1GB and not MAP_HUGETLB does compile and run for me. Not convinced that its' actually doing anything. Performance is the same.

replies(1): >>45141792 #

116. jared_hulbert ◴[05 Sep 25 18:14 UTC] No.45141749{4}[source]▶

>>45135343 #

I get that. But I do actually show a scenario where accessing data from memory using a very standard mechanism IS slower than a newer but equally standard way of accessing data from an NVMe drive.

"Accessing memory is slower in some circumstances than direct disk access"

117. inetknght ◴[05 Sep 25 18:18 UTC] No.45141792{7}[source]▶

>>45141621 #

Well now that it works, feel free to start poking around at it for a follow-up blog post :)

118. jared_hulbert ◴[05 Sep 25 18:33 UTC] No.45141994[source]▶

>>45136753 #

The io_uring solution avoids this whole effort of mapping. It doesn't have to map the already-in-RAM pages at all. It reuses a small set of buffers. So there is a lot of random cache-miss prone work that mmap() has to do that the io_uring solution avoids. If mmap() does this in the background it would cache up with io_uring. I'd then have to get a couple more drives to get io_uring to catch up. With enough drives I'd bet they'd be closer than you think. I still think I could get the io_uring to be faster than the mmap() even if the count never faulted, mostly because the io_uring has a smaller TLB footprint and can fit in L3 cache. But it'd be tough.

replies(1): >>45143282 #

119. jared_hulbert ◴[05 Sep 25 18:45 UTC] No.45142133{3}[source]▶

>>45135337 #

When I run the 50GB in-mem setup I still have 40GB+ of free memory, I drop the page cache before I run "sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'" there wouldn't really be anything to evict from page cache and swap isn't changing.

I think I'm crossing the numa boundary which means some percentage of the accesses are higher latency.

120. bawolff ◴[05 Sep 25 20:01 UTC] No.45142964{6}[source]▶

>>45140875 #

The mmap man page kind of implies that would be a no-op, but i haven't tested myself.

121. johncolanduoni ◴[05 Sep 25 20:10 UTC] No.45143065{3}[source]▶

>>45138707 #

This was a comparison of two methods of moving data from the VFS to application memory. Depending on cache status this would run the whole gambit of mapping existing memory pages, kernel to userspace memory copies, and actual disk access.

Also, while we’re being annoyingly technical, a lot of server CPUs can DMA straight to the L3 cache so your proof of impossibility is not correct.

replies(1): >>45147872 #

122. johncolanduoni ◴[05 Sep 25 20:19 UTC] No.45143170{4}[source]▶

>>45139067 #

On a modern desktop/server CPU, the RAM memory and PCIe device mapped memory do not share a bus. The equivalence is a fiction maintained by the MMU. Some chips (e.g. Apple Silicon) have unified memory such that RAM is accessible from the CPU and devices (GPU) on a shared bus, but this is a little different.

Also, direct access of device memory is quite slow. High throughput usecases like storage or network have relied entirely on DMA to system RAM from the device for decades.

123. kragen ◴[05 Sep 25 20:33 UTC] No.45143282{3}[source]▶

>>45141994 #

I agree that io_uring is a fundamentally more efficient approach, but I think the performance limits you're currently measuring with mmap() aren't the fundamental ones imposed by the mmap() API, and I think that's what you're saying too?

124. jared_hulbert ◴[05 Sep 25 21:03 UTC] No.45143586{3}[source]▶

>>45140052 #

Someone else suggested this, results are even worse by 2.5s.

125. MaxikCZ ◴[05 Sep 25 21:06 UTC] No.45143616{5}[source]▶

>>45136268 #

Oh god..

126. ozgrakkurt ◴[05 Sep 25 21:27 UTC] No.45143826{8}[source]▶

>>45138968 #

Out of topic but are you able to get performance benefit out of using RAID with NVMe disks?

127. wahern ◴[06 Sep 25 00:10 UTC] No.45145204{6}[source]▶

>>45140586 #

For the curious,

> When a 100G NIC is fully utilized with 64B packets and 20B Ethernet overhead, a new packet arrives every 6.72 nanoseconds on average. If any component on the packet path takes longer than this time to process the individual packet, a packet loss occurs. For a core running at 3GHz, 6.72 nanoseconds only accounts for 20 clock cycles, while the DRAM latency is 5-10 times higher, on average. This is the main bottleneck of the traditional DMA approach.

> The Intel® DDIO technology in Intel® Xeon® processors eliminates this bottleneck. Intel® DDIO technology allows PCIe devices to perform read and write operations directly to and from the L3 cache, or the last level cache (LLC).

https://www.intel.com/content/www/us/en/docs/vtune-profiler/...

replies(1): >>45147879 #

128. jared_hulbert ◴[06 Sep 25 00:39 UTC] No.45145423{6}[source]▶

>>45136885 #

Just ran a version with 6 prefetching threads. I get 5.81GB/s. Same as the io_uring with 2 drives, but still a lot slower than the in memory solution.

129. nine_k ◴[06 Sep 25 01:02 UTC] No.45145587{4}[source]▶

>>45140950 #

No, it can DMA straight into L3 cache, as mentioned in the article.

See https://www.intel.com/content/www/us/en/io/data-direct-i-o-t...

replies(1): >>45146256 #

130. inetknght ◴[06 Sep 25 01:57 UTC] No.45145898{8}[source]▶

>>45136551 #

It works fine on my ext4 fs...

131. wmf ◴[06 Sep 25 03:01 UTC] No.45146256{5}[source]▶

>>45145587 #

Your server doesn't have DDIO turned on though.

replies(1): >>45146577 #

132. nine_k ◴[06 Sep 25 04:19 UTC] No.45146577{6}[source]▶

>>45146256 #

Not mine, but fair!

133. squirrellous ◴[06 Sep 25 07:58 UTC] No.45147460{7}[source]▶

>>45140880 #

Cool! Thanks for the example. The aforementioned work thing requires MAP_SHARED as well which IIRC is the reason it would fail when used together with files and huge pages, but private mappings work as you show.

134. guenthert ◴[06 Sep 25 09:11 UTC] No.45147766[source]▶

>>45135399 #

Well, yes, but isn't one motivation of io_uring to make user space programming simpler and (hence) less error prone? I mean, i/o error handling on mmap isn't exactly trivial.

135. mrlongroots ◴[06 Sep 25 09:35 UTC] No.45147872{4}[source]▶

>>45143065 #

> This was a comparison of two methods of moving data from the VFS to application memory. Depending on cache status this would run the whole gambit of mapping existing memory pages, kernel to userspace memory copies, and actual disk access.

Yes, I think maybe a reasonable statement is that a benchmark is supposed to isolate a meaningful effect. This benchmark was not set up correctly to isolate a meaningful effect IMO.

> Also, while we’re being annoyingly technical, a lot of server CPUs can DMA straight to the L3 cache so your proof of impossibility is not correct.

Interesting, didn't know that, thanks!

I think this does not invalidate the point though. You can temporarily stream directly to the L3 cache with DDIO, but as it fills out the cache will get flushed back to the main memory anyway and you will ultimately be memory-bound. I don't think there is some way to do some non-temporal magic here that circumvents main memory entirely.

replies(1): >>45155285 #

136. mrlongroots ◴[06 Sep 25 09:36 UTC] No.45147879{7}[source]▶

>>45145204 #

From a response elsewhere in this thread (my current understanding, could be wrong):

> You can temporarily stream directly to the L3 cache with DDIO, but as it fills out the cache will get flushed back to the main memory anyway and you will ultimately be memory-bound. I don't think there is some way to do some non-temporal magic here that circumvents main memory entirely.

replies(1): >>45148562 #

137. menaerus ◴[06 Sep 25 09:37 UTC] No.45147885{4}[source]▶

>>45136744 #

Writing software would have been a much easier place if we had CPUs like that today :D

138. mrlongroots ◴[06 Sep 25 09:47 UTC] No.45147924{4}[source]▶

>>45139067 #

> The CPU can load from these virtual addresses directly into registers

This requires that device to bring meaningful amounts of its own memory. GPUs do that with VRAM. A storage device does not come with its own RAM, but interesting point!

139. wahern ◴[06 Sep 25 12:08 UTC] No.45148562{8}[source]▶

>>45147879 #

Ah. I guess it's only useful for an optimized router/firewall or network storage appliance, perhaps with a bespoke stack carefully tuned to quickly process and then hand the data back to the controller (via DDIO) before it flushes.

EDIT: Here's an interesting writeup about trying to make use of it with FreeBSD+netmap+ipfw: https://adrianchadd.blogspot.com/2015/04/intel-ddio-llc-cach... So it can work as advertised, it's just very constraining, requiring careful tuning if not outright rearchitecting your processing pipeline with the constraints in mind.

140. leopoldj ◴[06 Sep 25 13:09 UTC] No.45148978[source]▶

>>45132710 (OP) #

Is the io_uring code faster than mmap because multiple worker threads? If we use mmap from multiple threads will we get almost the same result?

141. hinkley ◴[06 Sep 25 19:47 UTC] No.45152315{3}[source]▶

>>45138707 #

I’m pretty sure that as of PCI-E 2 this is not true.

It’s only true if you need to process the data before passing it on. You can do direct DMA transfers between devices.

In which case one needs to remember that memory isn’t on the CPU. It has to beg for data just about as much as any peripheral. It uses registers and L1, which are behind two other layers of cache and an MMU.

142. johncolanduoni ◴[07 Sep 25 04:01 UTC] No.45155285{5}[source]▶

>>45147872 #

I agree, not a well designed benchmark. The mmap side did not use any of the mechanisms for paging in data before faulting each page, which are table stakes for this usecase.

The point of DDIO is that the data is frequently processed fast enough that it can get pulled from L3 to L1 and be finished with before it needs to be flushed to main memory. For a system with a coherent L3 or appropriate NUMA this isn’t really non-temporal, it’s more like the L3 cache is shared with the device.

↑