Understanding the BM25 full text search algorithm

1. hubraumhugo ◴[20 Nov 24 11:58 UTC] No.42193073[source]▶

>>42190650 (OP) #

Given the recent advances in vector-based semantic search, what's the SOTA search stack that people are using for hybrid keyword + semantic search these days?

replies(7): >>42193208 #>>42193787 #>>42193816 #>>42193909 #>>42193922 #>>42193932 #>>42194089 #

2. emschwartz ◴[20 Nov 24 12:15 UTC] No.42193208[source]▶

>>42193073 (TP) #

Most of the commercial and open source offerings for hybrid search seem to be using BM25 + vector similarity search based on embeddings. The results are combined using Reciprocal Rank Fusion (RRF).

The RRF paper is impressive in how incredibly simple it is (the paper is only 2 pages): https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf

replies(2): >>42193625 #>>42195796 #

3. d4rkp4ttern ◴[20 Nov 24 13:32 UTC] No.42193787[source]▶

>>42193073 (TP) #

In the Langroid[1] LLM library we have a clean, extensible RAG implementation in the DocChatAgent[2] -- it uses several retrieval techniques, including lexical (bm25, fuzzy search) and semantic (embeddings), and re-ranking (using cross-encoder, reciprocal-rank-fusion) and also re-ranking for diversity and lost-in-the-middle mitigation:

[1] Langroid - a multi-agent LLM framework from CMU/UW-Madison researchers https://github.com/langroid/langroid

[2] DocChatAgent Implementation - https://github.com/langroid/langroid/blob/main/langroid/agen...

Start with the answer_from_docs method and follow the trail.

Incidentally I see you're the founder of Kadoa -- Kadoa-snack is one of favorite daily tools to find LLM-related HN discussions!

4. noduerme ◴[20 Nov 24 13:36 UTC] No.42193816[source]▶

>>42193073 (TP) #

A generic search strategy is so different from something you want to target. The task should probably determine the tool.

So I don't know the answer, but I was recently handed about 3 million surveys with 10 free-form writing fields each, and tasked with surfacing the ones that might require action on the part of the company. I chose to use a couple of different small classifier models, manually strip out some common words based on obvious noise in the first 10k results, and then weight the model responses. It turned out to be almost flawless. I would NOT call this sort of thing "programming", it's more just tweaking the black-box output of various different tools until you have a set of results that looks good for your test cases. (And your client ;)

All stitching together small Hugging Face models running on a tiny server in nodejs, btw.

replies(2): >>42198653 #>>42220459 #

5. treprinum ◴[20 Nov 24 13:48 UTC] No.42193909[source]▶

>>42193073 (TP) #

text-embedding-3-large + SPLADE + RRF

6. khaki54 ◴[20 Nov 24 13:51 UTC] No.42193922[source]▶

>>42193073 (TP) #

We're doing something like BM25 with a semantic ontology enhanced query (naive example: search for truck hits on Ford F-150, even if truck never appears in the doc) then vector based reranking. In testing, we always get the best result in the top 3.

7. dmezzetti ◴[20 Nov 24 13:52 UTC] No.42193932[source]▶

>>42193073 (TP) #

Excellent article on BM25!

Author of txtai [1] here. txtai implements a performant BM25 index in Python [2] via the arrays package and storing the term frequency vectors in SQLite.

With txtai, the hybrid index approach [3] supports both convex combination when BM25 scores are normalized and reciprocal rank fusion (RRF) when they aren't [4].

[1] https://github.com/neuml/txtai

[2] https://neuml.hashnode.dev/building-an-efficient-sparse-keyw...

[3] https://neuml.hashnode.dev/benefits-of-hybrid-search

[4] https://github.com/neuml/txtai/blob/master/src/python/txtai/...

8. softwaredoug ◴[20 Nov 24 14:15 UTC] No.42194089[source]▶

>>42193073 (TP) #

My opinion is people need to not focus on one stack. But be prepared to use tools best for each job. Elasticsearch for BM25 type things. Turbopuffer for simple and fast vector retrieval. Even redis to precompute results for certain queries. Or certain extremely dynamic attributes that change frequently like price. Combine all these in a scatter/gather approach.

I say that because almost always you have a layer outside the search stack(s) that ideally can just be a straightforward inference service for reranking that looks most like other ML infra.

You also almost always route queries to different backends based on an understanding of the users query. Routing “lookup by ID” to a different system than “fuzzy semantic search”. These are very different data structures. And search almost always covers very broad/different use cases.

I think it’s an anti pattern to just push all work to one system. Each system is ideal for different workloads. And their inference capabilities won’t ever keep pace with the general ML tooling that your ML engineers are used to. (I tried with Elasticsearch Learning to Rank and its a hopeless task.)

(That said, Vespa is probably the best 'single stack' that tries to solve a broad range of use-cases.)

9. softwaredoug ◴[20 Nov 24 16:51 UTC] No.42195796[source]▶

>>42193208 #

A warning that RRF is often not Enough, as it can just drag a good solution down towards the worse solution :)

https://softwaredoug.com/blog/2024/11/03/rrf-is-not-enough

replies(1): >>42196484 #

10. emschwartz ◴[20 Nov 24 17:54 UTC] No.42196484{3}[source]▶

>>42195796 #

Ah, that's great! Thanks for sharing that.

I had actually implemented full text search + vector search using RRF but I kept it disabled by default because it wasn't meaningfully improving my results. This seems like a good hypothesis as to why.

11. keeeba ◴[20 Nov 24 22:12 UTC] No.42198653[source]▶

>>42193816 #

Nice, also find small classifiers work best for things like this. Out of interest, how many, if any, of the 3million were labelled?

Did you end up labelling any/more, or distilling from a generative model?

12. BOOSTERHIDROGEN ◴[23 Nov 24 11:55 UTC] No.42220459[source]▶

>>42193816 #

A blog post would be great.