Show HN: Krep a High-Performance String Search Utility Written in C

This wouldn't build for me, so I had to apply the patch suggested by a sibling comment.

Once I got it building, my first benchmark attempt shows it as being slower:

    $ curl -LO 'https://burntsushi.net/stuff/subtitles2016-sample.en.gz'
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100  265M  100  265M    0     0  48.6M      0  0:00:05  0:00:05 --:--:-- 49.9M
    $ gzip -d subtitles2016-sample.en.gz
    $ hyperfine --ignore-failure "rg -c 'ZQZQZQZQ' subtitles2016-sample.en" "krep -c 'ZQZQZQZQ' subtitles2016-sample.en"
    Benchmark 1: rg -c 'ZQZQZQZQ' subtitles2016-sample.en
      Time (mean ± σ):      80.7 ms ±   1.6 ms    [User: 57.7 ms, System: 22.7 ms]
      Range (min … max):    75.3 ms …  83.3 ms    35 runs

      Warning: Ignoring non-zero exit code.

    Benchmark 2: krep -c 'ZQZQZQZQ' subtitles2016-sample.en
      Time (mean ± σ):     122.8 ms ±   1.4 ms    [User: 372.6 ms, System: 24.4 ms]
      Range (min … max):   120.2 ms … 125.5 ms    24 runs

    Summary
      rg -c 'ZQZQZQZQ' subtitles2016-sample.en ran
        1.52 ± 0.03 times faster than krep -c 'ZQZQZQZQ' subtitles2016-sample.en

That's a benchmark with no matches, which is the best case essentially for throughput. Now I want to try a benchmark with a high match frequency:

    $ hyperfine "rg -c 'the' subtitles2016-sample.en" "krep -c 'the' subtitles2016-sample.en"
    Benchmark 1: rg -c 'the' subtitles2016-sample.en
      Time (mean ± σ):     411.8 ms ±   3.6 ms    [User: 389.7 ms, System: 21.1 ms]
      Range (min … max):   404.8 ms … 415.7 ms    10 runs

    Benchmark 2: krep -c 'the' subtitles2016-sample.en
      Time (mean ± σ):     121.2 ms ±   1.9 ms    [User: 364.6 ms, System: 24.9 ms]
      Range (min … max):   113.2 ms … 123.0 ms    24 runs

    Summary
      krep -c 'the' subtitles2016-sample.en ran
        3.40 ± 0.06 times faster than rg -c 'the' subtitles2016-sample.en

Which is very nice. So I decided to poke at it:

    $ krep -c the subtitles2016-sample.en
    Found 29794426 matches
    $ rg -c the subtitles2016-sample.en
    6123710
    $ grep -c the subtitles2016-sample.en
    6123710

The counts are way off here. At first I thought maybe it was counting every occurrence of `the` instead of every matching line, but when I ask ripgrep to do that, it gets a different answer:

    $ rg -oc the subtitles2016-sample.en
    7739791
    $ rg -o the subtitles2016-sample.en | wc -l
    7739791
    $ grep -o the subtitles2016-sample.en | wc -l
    7739791

So not sure what's going on here, but it looks like `krep` might not be giving accurate results.

Pushing it a bit more, it seems like it just kind of falls over?

    $ time rg -c 'You read Sherlock Holmes to deduce that\?' subtitles2016-sample.en
    10

    real    0.076
    user    0.049
    sys     0.026
    maxmem  923 MB
    faults  0
    $ time krep -c 'You read Sherlock Holmes to deduce that?' subtitles2016-sample.en
    Found 0 matches

    real    0.935
    user    3.597
    sys     0.029
    maxmem  918 MB
    faults  0

I ran the above benchmarks in `/dev/shm` on Linux with an i9-12900K.

In terms of the approach here, ripgrep is already using a pretty sophisticated substring search algorithm: https://github.com/BurntSushi/memchr?tab=readme-ov-file#algo...

And it uses memory maps (sometimes, when it thinks it will be fast, but it will do so in the single file case on Linux).

ripgrep also uses parallelism, but at inter-file level. It sounds like `krep` also uses parallelism, but will use multiple threads when searching a single file. I've considered doing the same in ripgrep, but haven't done enough experiments (or seen enough from someone else) to be convinced that it's the right way to go in general. It might edge out single threaded search in some cases for sure though.

EDIT: Looking at the timings in https://dev.to/daviducolo/introducing-krep-building-a-high-p..., I see, for example, ripgrep taking >40 seconds to search for the literal pattern `error` in a 5GB file. Even if you're reading from disk (which the OP is using an SSD), that does not seem right at all. Even for an exceptionally common word like `the` in this haystack, ripgrep can chew through a 13GB file in 5 seconds on my machine:

    $ time rg -c the full.txt
    83499915

    real    5.404
    user    5.092
    sys     0.302
    maxmem  12511 MB
    faults  0

Even if I force reading from disk, we get nowhere near 40 seconds:

    $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
    $ time rg -c the full.txt
    83499915

    real    10.577
    user    5.191
    sys     2.105
    maxmem  12511 MB
    faults  42

I'm not saying the benchmark results are definitely wrong, but something looks off here that I can't easily explain. OP, can you please share a way to fully reproduce your benchmark? (Like I did above for `subtitles2016-sample.en`.)