Most active commenters

    ←back to thread

    283 points ghuntley | 12 comments | | HN request time: 0.01s | source | bottom
    Show context
    ayende ◴[] No.45135399[source]
    This is wrong, because your mmap code is being stalled for page faults (including soft page faults that you have when the data is in memory, but not mapped to your process).

    The io_uring code looks like it is doing all the fetch work in the background (with 6 threads), then just handing the completed buffers to the counter.

    Do the same with 6 threads that would first read the first byte on each page and then hand that page section to the counter, you'll find similar performance.

    And you can use both madvice / huge pages to control the mmap behavior

    replies(4): >>45135629 #>>45138707 #>>45140052 #>>45147766 #
    1. lucketone ◴[] No.45135629[source]
    It would seem you summarised whole post.

    That’s the point: “mmap” is slow because it is serial.

    replies(1): >>45136283 #
    2. arghwhat ◴[] No.45136283[source]
    mmap isn't "serial", the code that was using the mapping was "serial". The kernel will happily fill different portions of the mapping in parallel if you have multiple threads fault on different pages.

    (That doesn't undermine that io_uring and disk access can be fast, but it's comparing a lazy implementation using approach A with a quite optimized one using approach B, which does not make sense.)

    replies(4): >>45136633 #>>45136749 #>>45136761 #>>45136970 #
    3. amelius ◴[] No.45136633[source]
    OK, so we need a comparison between a multi threaded mmap approach and io_uring. Which would be faster?
    replies(1): >>45136885 #
    4. immibis ◴[] No.45136749[source]
    How do you do embarrassingly async memory access with mmap?
    replies(1): >>45137619 #
    5. ◴[] No.45136761[source]
    6. nabla9 ◴[] No.45136885{3}[source]
    If the memory access pattern is the same, there are no significant differences.
    replies(1): >>45145423 #
    7. rafaelmn ◴[] No.45136970[source]
    OK this is not my level of stack for over a decade now, but writing a multithreaded code that will generate the same pagefaults on a shared mmap buffer, as opposed to something that kernel io scheduler will do on your behalf, and presumably try to schedule optimally for your machine/workload - does not sound comparable.

    Thats like arguing python is not slower than C++ because you could technically write a specialized AOT compiler for your python code that would generate equivalent assembly so in the end it is the same ?

    8. lordgilman ◴[] No.45137619{3}[source]
    You dereference a pointer.
    replies(1): >>45138631 #
    9. icedchai ◴[] No.45138631{4}[source]
    From the application perspective, it's not truly async. On a deference, your app may be blocked indefinitely as data is paged into memory. In the early 2000's I worked on systems that made heavy use of mmap. In constrained ("dev") environments with slow disks, you could be blocked for several seconds...
    replies(1): >>45138968 #
    10. jlokier ◴[] No.45138968{5}[source]
    This branch of the discussion is is about dereferencing on multiple threads concurrently. That doesn't block the application, each mmap'd dereference only blocks its own thread (same as doing read()).

    In my own measurements with NVMe RAID, doing this works very well on Linux for storage I/O.

    I was getting similar performance to io_uring with O_DIRECT, and faster performance when the data is likely to be in the page cache on multiple runs, because the multi-threaded mmap method shares the kernel page cache without copying data.

    To measure this, replace the read() calls in the libuv thread pool function with single-byte dereferences, mmap a file, and call a lot of libuv async reads. That will make libuv do the dereferences in its thread pool and return to the main application thread having fetched the relevant pages. Make sure libuv is configured to use enough threads, as it doesn't use enough by default.

    replies(1): >>45143826 #
    11. ozgrakkurt ◴[] No.45143826{6}[source]
    Out of topic but are you able to get performance benefit out of using RAID with NVMe disks?
    12. jared_hulbert ◴[] No.45145423{4}[source]
    Just ran a version with 6 prefetching threads. I get 5.81GB/s. Same as the io_uring with 2 drives, but still a lot slower than the in memory solution.