←back to thread

62 points hiAndrewQuinn | 2 comments | | HN request time: 0.517s | source
Show context
ctur ◴[] No.44392592[source]
This is an unnecessary optimization, particularly for the article's use case (small files that are read immediately after being written). Just use /tmp. The linux buffer cache is more than performant enough for casual usage and, indeed, most heavy usage too. It's far too easy to clog up memory with forgotten files by defaulting to /dev/shm, for instance, and you potentially also take memory away from the rest of the system until the next reboot.

For the author's purposes, any benefit is just placebo.

There absolutely are times where /dev/shm is what you want, but it requires understanding nuances and tradeoffs (e.g. you are already thinking a lot about the memory management going on, including potentially swap).

Don't use -funroll-loops either.

replies(5): >>44392628 #>>44392716 #>>44392880 #>>44393209 #>>44393478 #
1. hiAndrewQuinn ◴[] No.44392716[source]
It's true that with small files, my primary interest is simply not to wear on my disk unnecessarily. However I do also often do work on large files, usually local data processing work.

"This optimization [of putting files directly into RAM instead of trusting the buffers] is unnecessary" was an interesting claim, so I decided to put it to the test with `time`.

    $ # Drop any disk caches first.
    $ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
    $ 
    $ # Read a 3.5 GB JSON Lines file from disk.
    $ time wc -l /home/andrew/Downloads/kaikki.org-dictionary-Finnish.jsonl 
    255111 /home/andrew/Downloads/kaikki.org-dictionary-Finnish.jsonl

    real 0m2.249s
    user 0m0.048s
    sys 0m0.809s

    $ # Now with caching.
    $ time wc -l /dev/shm/kaikki.org-dictionary-Finnish.jsonl 
    255111 /dev/shm/kaikki.org-dictionary-Finnish.jsonl
    
    real 0m0.528s
    user 0m0.028s
    sys 0m0.500s

    $ 
    $ # Drop caches again, just to be certain.
    $ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
    $ 
    $ # Read that same 3.5 GB LSON Lines file from /dev/shm.
    $ time wc -l /dev/shm/kaikki.org-dictionary-Finnish.jsonl 
    255111 /dev/shm/kaikki.org-dictionary-Finnish.jsonl

    real 0m0.453s
    user 0m0.049s
    sys 0m0.404s
Compared to the first read there is indeed a large speedup, from 2.2s down to under 0.5s. After the file had been loaded into cache from disk by the first `wc --lines`, however, the difference dropped to /dev/shm being about ~20% faster. Still significant, but not game-changingly so.

I'll probably come back to this and run more tests with some of the more complex `jq` query stuff I have to see if we stay at that 20% mark, or if it gets faster or slower.

replies(1): >>44393176 #
2. AdieuToLogic ◴[] No.44393176[source]
A couple things to consider when benchmarking RAM file I/O verses disk-based file system I/O.

1 - Programs such as wc (or jq) do sequential reads, which benefit from file systems optimistically prefetching contents in order to reduce read delays.

2 - Check to see if file access time tracking is enabled for the disk-based file system (see mount(8)). This may explain some of the 20% difference.