Naively, isn't IO the bottleneck?
IE, I'd think that loading a file would be slow enough that krep would be IO-bound?
Do you have a typical ratio of IO time to search time on a modern disk and CPU?
What about a producer-consumer model where one thread reads files and creates an in-memory queue of file contents; and a different thread handles the actual searching without pauses for IO?
Edit: If you're truly CPU-bound, another variation of producer-consumer is to have a single thread read files into queues, and then multiple threads searching through files. Each thread would search through a single file at a time. This eliminates the shared memory issue that you allude to with overlap.
Iff the statement about prefetching is true though, I wonder how the prefetching wouldn't be bamboozled by the multiple threads accessing the file.