Show HN: Krep a High-Performance String Search Utility Written in C

1. daviducolo ◴[11 Mar 25 16:16 UTC] No.43333994[source]▶

You can read my blog post about the project at https://dev.to/daviducolo/introducing-krep-building-a-high-p...

replies(4): >>43335277 #>>43335647 #>>43337748 #>>43339241 #

2. geocar ◴[11 Mar 25 17:59 UTC] No.43335277[source]▶

Hi David.

    $ (for x in `seq 1 100000`; do echo 'I am a Test Vector HeLlO World '"$x"; done) > /dev/shm/krep_tmp

Best of three runs shown:

    $ time ./krep -i hello /dev/shm/krep_tmp
    Found 43721 matches
    Search completed in 0.0017 seconds (2017.44 MB/s)
    Search details:
      - File size: 3.52 MB
      - Pattern length: 5 characters
      - Using AVX2 acceleration
      - Case-insensitive search
    real        0m0,005s
    user        0m0,001s
    sys         0m0,004s
    $ time ./krep HeLlO /dev/shm/krep_tmp
    Found 82355 matches
    Search completed in 0.0014 seconds (1259.72 MB/s)
    Search details:
      - File size: 1.71 MB
      - Pattern length: 5 characters
      - Using AVX2 acceleration
      - Case-sensitive search
    real        0m0,004s
    user        0m0,003s
    sys         0m0,004s
    $ time ./krep -i "HeLlO World" /dev/shm/krep_tmp
    Found 99958 matches
    Search completed in 0.0021 seconds (1700.54 MB/s)
    Search details:
      - File size: 3.52 MB
      - Pattern length: 11 characters
      - Using AVX2 acceleration
      - Case-insensitive search
    real        0m0,005s
    user        0m0,002s
    sys         0m0,004s
    $ time ./krep "I am a Test Vector HeLlO World" /dev/shm/krep_tmp
    Found 3964 matches
    Search completed in 0.0149 seconds (235.83 MB/s)
    Search details:
      - File size: 3.52 MB
      - Pattern length: 30 characters
      - Using AVX2 acceleration
      - Case-sensitive search
    real        0m0,016s
    user        0m0,015s
    sys         0m0,001s
    $ time ./krep -i "I am a Test Vector hello World" /dev/shm/krep_tmp
    Found 3964 matches
    Search completed in 0.0178 seconds (197.70 MB/s)
    Search details:
      - File size: 3.52 MB
      - Pattern length: 30 characters
      - Using AVX2 acceleration
      - Case-insensitive search
    real        0m0,021s
    user        0m0,017s
    sys         0m0,004s

Benchmark with fgrep (the first run was good enough):

    $ time fgrep -ci hello /dev/shm/krep_tmp
    100000
    real        0m0,003s
    user        0m0,003s
    sys         0m0,000s
    $ time fgrep -ci "I am a Test Vector hello World" /dev/shm/krep_tmp
    100000
    real 0m0,010s
    user 0m0,009s
    sys         0m0,000s
    $ time fgrep -c "I am a Test Vector HeLlO World" /dev/shm/krep_tmp
    100000
    real 0m0,005s
    user 0m0,004s
    sys         0m0,001s

This is a model name: Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz. There's 40gb of ram free and 10 cores doing nothing. shell is cpuset. On commit 95ed1853b561396c8a8bcbbdd115ed6273848e3f (HEAD -> main, origin/main, origin/HEAD). gcc is 13.3.0-6ubuntu2~24.04

tl;dr: krep produces obviously wrong results slower than fgrep.

replies(2): >>43335306 #>>43340892 #

3. burntsushi ◴[11 Mar 25 18:03 UTC] No.43335306[source]▶

>>43335277 #

Consider using a bigger haystack. Your timings are so short that you're mostly just measuring the overhead of running a process.

This is relevant to krep because it spawns threads to search files (I guess for files over 1MB?).

This does not mean your benchmark is worthless. It just means you can't straight-forwardly generalize from it.

replies(3): >>43335857 #>>43336268 #>>43340461 #

4. gwbas1c ◴[11 Mar 25 18:37 UTC] No.43335647[source]▶

>>43333994 (TP) #

I'm curious why krep runs faster with large files in a multithreaded manner?

Naively, isn't IO the bottleneck?

IE, I'd think that loading a file would be slow enough that krep would be IO-bound?

Do you have a typical ratio of IO time to search time on a modern disk and CPU?

What about a producer-consumer model where one thread reads files and creates an in-memory queue of file contents; and a different thread handles the actual searching without pauses for IO?

Edit: If you're truly CPU-bound, another variation of producer-consumer is to have a single thread read files into queues, and then multiple threads searching through files. Each thread would search through a single file at a time. This eliminates the shared memory issue that you allude to with overlap.

replies(1): >>43337888 #

5. globnomulous ◴[11 Mar 25 18:58 UTC] No.43335857{3}[source]▶

>>43335306 #

That's a good point, though the readme does flatly state that krep "is designed with performance as a primary goal," so the lede's generalization that it is "blazingly fast" isn't correct, despite the later, more deeply buried caveat that "Performance may vary based on hardware, file characteristics, and search pattern" (which describes all software). And the comment you answered doesn't say just that krep is "slower" than fgrep; it says krep "produces obviously wrong results" slower.

Edit: and the fact that krep lacks regular-expression support means it's not a replacement for grep or meaningfully comparable with it.

replies(1): >>43335973 #

6. burntsushi ◴[11 Mar 25 19:09 UTC] No.43335973{4}[source]▶

>>43335857 #

I try my best to interpret pithy phrases describing a project as first order approximations, rather than literal statements of truth that perfectly generalize. Pithiness is important for communicating ideas quickly, but precision and pithiness are often in tension with one another. So I adjust my expectations accordingly.

Yes, I agree that the wrong results are bad. But that doesn't invalidate my point. I even went out of my way to clarify that the benchmark wasn't worthless. Benchmarking the small input case is absolutely worth it. You just can't tell much about its scaling properties when your measurement is basically "how fast does the process start and stop." Which, again, to be clear, IT MATTERS. It just probably doesn't matter as much as readers might think it matters when they see it.

So treat my comment as adding helpful context for readers that aren't experts in benchmarking grep tools from someone experienced in... benchmarking grep tools. :-) (And regexes in general. See: https://github.com/BurntSushi/rebar)

7. fanf2 ◴[11 Mar 25 19:41 UTC] No.43336268{3}[source]▶

>>43335306 #

The incorrect results are far more important than the times!

replies(1): >>43337371 #

8. burntsushi ◴[11 Mar 25 21:43 UTC] No.43337371{4}[source]▶

>>43336268 #

I agree.

9. MattPalmer1086 ◴[11 Mar 25 22:23 UTC] No.43337748[source]▶

>>43333994 (TP) #

Interesting. You may be interested in a more modern search algorithm to replace Boyer Moore. I recently presented HashChain, a very fast sublinear search algorithm at the Symposium of Experimental Algorithms.

https://drops.dagstuhl.de/storage/00lipics/lipics-vol301-sea...

It's the fastest sublinear search algorithm around in almost all cases. I also have a guaranteed worst-case linear version (which is still sublinear and much faster than Boyer Moore in the average case).

Sample code is available here:

https://github.com/nishihatapalmer/HashChain

If you're interested, let me know.

replies(1): >>43342542 #

10. lainzhow ◴[11 Mar 25 22:41 UTC] No.43337888[source]▶

>>43335647 #

I didn't read the source, but from the description it says it uses memory mapping. So my guess here is that IO isn't so much of an issue since prefetching can hide away the latency if you are able to memory map a large enough segment of the file.

Iff the statement about prefetching is true though, I wonder how the prefetching wouldn't be bamboozled by the multiple threads accessing the file.

replies(1): >>43338578 #

11. gwbas1c ◴[12 Mar 25 00:13 UTC] No.43338578{3}[source]▶

>>43337888 #

Forgot about memory mapping.

In that case it probably makes more sense to have a shared queue of files, and each thread handles a single file at a time. It'll avoid the overlap issue.

12. rafram ◴[12 Mar 25 02:04 UTC] No.43339241[source]▶

>>43333994 (TP) #

Was this entire blog post written by AI too? Honestly, this is an impressive feat if so, but the fake benchmarks are really not a good look.

13. gigatexal ◴[12 Mar 25 06:43 UTC] No.43340507{4}[source]▶

>>43340461 #

Give the author some grace. Sheesh. You sound like a toxic senior dev

What about opening an issue with the incorrect results

replies(1): >>43340806 #

14. geocar ◴[12 Mar 25 07:53 UTC] No.43340806{5}[source]▶

>>43340507 #

> What about opening an issue with the incorrect results

I don't work for you.

> You sound like a toxic senior dev

I'm probably much nicer to the people I work with, but that's because an enormous amount of money is involved.

You want to have that kind of a conversation, I'll consider letting you tell me what to do, but you need to understand the service I'm providing for free doesn't come with that.

People don't like being told what to do. They are probably going to be in a bad mood if you waste their time with a bold claim that is so obviously false. You are lucky to have someone tell you that the software doesn't work, because you obviously wouldn't know otherwise.

Best of luck.

replies(1): >>43359692 #

15. ◴[12 Mar 25 08:13 UTC] No.43340892[source]▶

>>43335277 #

16. rurban ◴[12 Mar 25 12:35 UTC] No.43342542[source]▶

>>43337748 #

A benchmark of all good string-search algos is here: https://rurban.github.io/smart/results/best20/

The ones parent picked are not amongst the best at all. I added a ticket to add your HashChain algos to be tested also.

Preliminary testing shows indeed excellent numbers for hc2-hc4. Comes close to musl

replies(3): >>43343045 #>>43345885 #>>43347251 #

17. MattPalmer1086 ◴[12 Mar 25 13:29 UTC] No.43343045{3}[source]▶

>>43342542 #

Amazing, thanks. The sample code is already written to integrate with smart (I have also contributed to the smart tool!).

You may want to add the linear version of HashChain too (in the same GitHub repo).

18. MattPalmer1086 ◴[12 Mar 25 17:53 UTC] No.43345885{3}[source]▶

>>43342542 #

Looks like musl is using pointer arithmetic?

Can speed up most of the algorithms in smart by using that. It's faster than using array indexing generally. I didn't do that in the sample code for HashChain simply because it would be an unfair comparison with the others.

19. MattPalmer1086 ◴[12 Mar 25 20:18 UTC] No.43347251{3}[source]▶

>>43342542 #

The musl timings are actually incorrect, because it resets the timer on each iteration of the main loop used to count matches. So you only record the time to make the last match, not the total time. I've raised an issue on the GitHub for smart.

replies(1): >>43347920 #

20. rurban ◴[12 Mar 25 21:37 UTC] No.43347920{4}[source]▶

>>43347251 #

Oh, oh, thanks!

replies(1): >>43352002 #

21. MattPalmer1086 ◴[13 Mar 25 10:46 UTC] No.43352002{5}[source]▶

>>43347920 #

I was surprised that an algorithm from 1991 was the fastest! My first rule of benchmarking is to be suspicious of surprises!

replies(1): >>43390773 #

22. gigatexal ◴[14 Mar 25 04:53 UTC] No.43359692{6}[source]▶

>>43340806 #

Oh well it’s nice to know that if we work together you’ll be nice since you’re being paid to be.

But do you treat others outside of work like this? What if your partner makes a mistake? Your kid thinks they’ve done something awesome but have a huge flaw in whatever it is?

> They are probably going to be in a bad mood if you waste their time with a bold claim that is so obviously false.

Did you review this on company time? On your own time? How is your time being wasted giving feedback to another developer? You’re improving the community of developers. You don’t see the benefit of that?

I wonder how you tolerate failure? Weakness? In others. I don’t think well but what do I care? right? I’m not in the business of helping anyone better themselves or anything ;-)

23. rurban ◴[17 Mar 25 17:28 UTC] No.43390773{6}[source]▶

>>43352002 #

Yes, I updated the benchmarks of the best string search algos.