Understanding the BM25 full text search algorithm

https://github.com/jankovicsandras/plpgsql_bm25

Shameless plug:

https://github.com/jankovicsandras/bm25opt

replies(2): >>42192810 #>>42194312 #

2. jll29 ◴[20 Nov 24 07:29 UTC] No.42191546[source]▶

Nice write-up.

A few more details/background that are harder to find: "BM25" stands for "Best Matching 25", "best matching" becaue it is a formula for ranking and term weighting (the matching refers to the term in the query versus the document), and the number 25 simply indicates a running number (there were 24 earlier formula variants and some later ones, but #25 turned out to work best, so it was the one that was published).

It was conceived by Stephen Robertson and Karen Spärck Jones (the latter of IDF fame) and first implemented in the former's OKAPI information retrieval (research) system. The OKAPI system was benchmarked at the annual US NIST TREC (Text Retrieval Conference) for a number of years, the international "World Champtionship" of search engine methods (although the event is not about winning, but about compariing notes and learning from each other, a highly recommended annual event held every November in Gaithersburg, Maryland, attended by global academic and industry teams that conduct research on improving search - see trec.nist.gov).

Besides the "bag of words" Vector Space Model (sparse vectors of terms), the Probabilistic Modles (that BM25 belongs to), there are suprising and still growing number of other theoretical frameworks how to rank a set of documents, given a query ("Divergence from Randomness", "Statistical Language Modeling, "Learning to Rank", "Quantum Information Retrieval", "Neural Ranking" etc.). Conferences like ICTIR and SIGIR still publish occasionaly entirely new paradigms for search. Note that the "Statistical Language Modeling" paradigm is not about Large Language Models that are on vogue now (that's covered under the "Neural Retrieval" umbrella), and that "Quantum IR" is not going to get you to a tutorial about Quantum Information Retrieval but to methods of infrared spectroscopy or a company with the same name that produces cement; such are the intricacies of search technology, even in the 21st century.

If you want to play with BM25 and compare it with some of the alternatives, I recommend the research platform Terrier, and open-source search engine developed at the University of Glasgow (today, perhaps the epicenter of search research).

BM25 is over a quarter century old, but has proven to be a hard baseline to beat (it is still often used as a reference point for comparing new nethods against), and a more recent variant, BM24F, can deal with multiple fields and hypertext (e.g. title, body of documents, hyperlinks).

The recommended paper to read is: Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 1". Information Processing & Management 36(6): 779–808, and its successor, Part 2. (Sadly they are not open access.)

replies(2): >>42191965 #>>42196169 #

3. sidcool ◴[20 Nov 24 08:32 UTC] No.42191844[source]▶

Good article. I am genuinely interested to learn about how to think of problems in such a mathematical form. And how to test it. Any resources?

4. marcyb5st ◴[20 Nov 24 08:51 UTC] No.42191965[source]▶

>>42191546 #

Thanks for sharing!

Do you have more information about BM24F? Googling (also Google scholar) didn't yield anything related. Thanks in advance!

replies(1): >>42192073 #

5. bradleyjkemp ◴[20 Nov 24 09:12 UTC] No.42192073{3}[source]▶

>>42191965 #

A typo I think, should be BM25F. From Wikipedia:

> BM25F (or the BM25 model with Extension to Multiple Weighted Fields) is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) https://en.wikipedia.org/wiki/Okapi_BM25

Some papers are linked in the references

replies(1): >>42193216 #

6. RA_Fisher ◴[20 Nov 24 10:47 UTC] No.42192651[source]▶

BM25 is an ancient algo developed in the 1970s. It’s basically a crappy statistical model and statisticians can do far better today. Search is strictly dominated by learning (that yes, can use search as an input). Not many folks realize that yet, and / or are incentivized to keep the old tech going as long as possible, but market pressures will change that.

replies(4): >>42192735 #>>42192805 #>>42192828 #>>42194229 #

7. netdur ◴[20 Nov 24 11:03 UTC] No.42192735[source]▶

While BM25 did emerge from earlier work in the 1970s and 1980s (specifically building on the probabilistic ranking principle), I'm curious about your perspective on a few things:

What specific modern statistical approaches are you seeing as superior replacements for BM25 in practical applications? I'm particularly interested in how they handle edge cases like rare terms and document length normalization that BM25 was explicitly designed to address.

While I agree learning-based approaches have shown impressive results, could you elaborate on what you mean by search being "strictly dominated" by learning methods? Are you referring to specific benchmarks or real-world applications?

replies(1): >>42193439 #

8. simplecto ◴[20 Nov 24 11:16 UTC] No.42192805[source]▶

https://www.youtube.com/watch?v=ENFW1uHsrLM

Those are some really spicy opinions. It would seem that many search experts might not agree.

David Tippet (formerly opensearch and now at Github)

A great podcast with David Tippet and Nicolay Gerold entitled:

"BM25 is the workhorse of search; vectors are its visionary cousin"

replies(2): >>42192855 #>>42193450 #

9. mark_l_watson ◴[20 Nov 24 11:16 UTC] No.42192810[source]▶

>>42191251 #

Thanks, yesterday I was thinking of adding BM25 to a little side project, so a well timed plug!

Do you know of any pure Python wrapper projects for managing large numbers of text and PDF documents? I thought of using Solr or ElasticSearch but that seems too heavy weight for what I am doing. I am considering using SQLite with pysqlite3 and PyPDF2 since SQLite uses BM25. Sorry to be off topic, but I imagine many people are looking at tools for building hybrid BM25 / vector store / LLM applications.

replies(1): >>42199238 #

10. mrbungie ◴[20 Nov 24 11:20 UTC] No.42192828[source]▶

Are those the same market pressures that made Google discard or repurpose a lot of working old search tech for new shiny ML-based search tech? The same tech that makes you add "+reddit" in every search so you can evade the adversarial SEO war?

PS: Ancient != bad. I don't know what weird technologist take worries about the age of an invention/discovery of a technique rather than its usefulness.

replies(1): >>42193425 #

11. dumb1224 ◴[20 Nov 24 11:25 UTC] No.42192855{3}[source]▶

>>42192805 #

Agreed. In the 2000s it was all about BM25 in the NLP community. I hardly see any paper that did not mention it in my opinion.

replies(2): >>42193496 #>>42193948 #

12. hubraumhugo ◴[20 Nov 24 11:58 UTC] No.42193073[source]▶

Given the recent advances in vector-based semantic search, what's the SOTA search stack that people are using for hybrid keyword + semantic search these days?

replies(7): >>42193208 #>>42193787 #>>42193816 #>>42193909 #>>42193922 #>>42193932 #>>42194089 #

13. emschwartz ◴[20 Nov 24 12:15 UTC] No.42193208[source]▶

Most of the commercial and open source offerings for hybrid search seem to be using BM25 + vector similarity search based on embeddings. The results are combined using Reciprocal Rank Fusion (RRF).

The RRF paper is impressive in how incredibly simple it is (the paper is only 2 pages): https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf

replies(2): >>42193625 #>>42195796 #

14. marcyb5st ◴[20 Nov 24 12:16 UTC] No.42193216{4}[source]▶

>>42192073 #

Thanks, really appreciate it!

15. RA_Fisher ◴[20 Nov 24 12:42 UTC] No.42193425{3}[source]▶

>>42192828 #

Google’s come a long way since PageRank + terms. Ancient doesn’t mean bad, but usually it means outdated and that’s the case here. Search algos are subsumed by learning models, our species can do better now.

replies(1): >>42193690 #

16. RA_Fisher ◴[20 Nov 24 12:43 UTC] No.42193439{3}[source]▶

>>42192735 #

BM25 can be used as a starting point for a statistical learning model and more readily built on. A key advantage is that one gains a systematic way to reduce edge cases, instead of handling a couple, bc they’re so large as to be noticeable.

17. RA_Fisher ◴[20 Nov 24 12:46 UTC] No.42193450{3}[source]▶

>>42192805 #

I’m sure Search experts would disagree, because it’d be their technology they’d be admitting is inferior to another. BM25 is the workhorse, no doubt— but it’s also not the best anymore. Vectors are a step toward learning models, but only a small mid-range step vs. an explicit model.

Search is a useful approach for computing learning models, but there’s a difference between the computational means and the model. For example, MIPS is a very useful search algo for computing learning models (but first the learning model has to be formulated).

replies(3): >>42193880 #>>42194290 #>>42197352 #

18. RA_Fisher ◴[20 Nov 24 12:52 UTC] No.42193496{4}[source]▶

>>42192855 #

For sure, it’s very popular, just not the best anymore (and actually far from it).

19. mbreese ◴[20 Nov 24 13:18 UTC] No.42193690{4}[source]▶

>>42193425 #

So, I’m not entirely sure if I follow you here… How would one use a language model to find a document out of a corpus of existing documents? As opposed to finding an answer to a question, trained on documents, which I can see. I mean answering a query like “find the report containing X”?

I see search as encompassing at least two separate, but related, domains: information gathering/seeking (answering a question) and information retrieval (find the best matching document). I’m curious how LLMs can help with the later.

replies(1): >>42194869 #

20. d4rkp4ttern ◴[20 Nov 24 13:32 UTC] No.42193787[source]▶

In the Langroid[1] LLM library we have a clean, extensible RAG implementation in the DocChatAgent[2] -- it uses several retrieval techniques, including lexical (bm25, fuzzy search) and semantic (embeddings), and re-ranking (using cross-encoder, reciprocal-rank-fusion) and also re-ranking for diversity and lost-in-the-middle mitigation:

[1] Langroid - a multi-agent LLM framework from CMU/UW-Madison researchers https://github.com/langroid/langroid

[2] DocChatAgent Implementation - https://github.com/langroid/langroid/blob/main/langroid/agen...

Start with the answer_from_docs method and follow the trail.

Incidentally I see you're the founder of Kadoa -- Kadoa-snack is one of favorite daily tools to find LLM-related HN discussions!

21. noduerme ◴[20 Nov 24 13:36 UTC] No.42193816[source]▶

A generic search strategy is so different from something you want to target. The task should probably determine the tool.

So I don't know the answer, but I was recently handed about 3 million surveys with 10 free-form writing fields each, and tasked with surfacing the ones that might require action on the part of the company. I chose to use a couple of different small classifier models, manually strip out some common words based on obvious noise in the first 10k results, and then weight the model responses. It turned out to be almost flawless. I would NOT call this sort of thing "programming", it's more just tweaking the black-box output of various different tools until you have a set of results that looks good for your test cases. (And your client ;)

All stitching together small Hugging Face models running on a tiny server in nodejs, btw.

replies(1): >>42198653 #

22. simplecto ◴[20 Nov 24 13:44 UTC] No.42193880{4}[source]▶

>>42193450 #

It seems that the current mode (eg fashion) is a hybrid approach, with vector results on one side, BM25 on the other, and then a re-reank algo to smooth things out.

I'm out of my depth here but genuinely interested and curious to see over the horizon.

replies(2): >>42193942 #>>42196684 #

23. treprinum ◴[20 Nov 24 13:48 UTC] No.42193909[source]▶

text-embedding-3-large + SPLADE + RRF

24. khaki54 ◴[20 Nov 24 13:51 UTC] No.42193922[source]▶

We're doing something like BM25 with a semantic ontology enhanced query (naive example: search for truck hits on Ford F-150, even if truck never appears in the doc) then vector based reranking. In testing, we always get the best result in the top 3.

25. ◴[20 Nov 24 13:52 UTC] No.42193930[source]▶

26. dmezzetti ◴[20 Nov 24 13:52 UTC] No.42193932[source]▶

[2] https://neuml.hashnode.dev/building-an-efficient-sparse-keyw...

Excellent article on BM25!

Author of txtai [1] here. txtai implements a performant BM25 index in Python [2] via the arrays package and storing the term frequency vectors in SQLite.

With txtai, the hybrid index approach [3] supports both convex combination when BM25 scores are normalized and reciprocal rank fusion (RRF) when they aren't [4].

[1] https://github.com/neuml/txtai

[3] https://neuml.hashnode.dev/benefits-of-hybrid-search

[4] https://github.com/neuml/txtai/blob/master/src/python/txtai/...

27. authorfly ◴[20 Nov 24 13:53 UTC] No.42193942{5}[source]▶

>>42193880 #

Out of interest how come you use the word "mode" here?

replies(1): >>42194037 #

28. authorfly ◴[20 Nov 24 13:55 UTC] No.42193948{4}[source]▶

>>42192855 #

And dependency chaining. But yes, lots of BM25.

The 2000s and even 2010s was a wonderful and fairly theoretical time for linguistics and NLP. A time when NLP seemed to harbor real anonymized general information to make the right decisions with, without impinging on privacy.

Oh to go back.

29. DavidPP ◴[20 Nov 24 14:02 UTC] No.42194005[source]▶

We use https://typesense.org/ for regular search, but it now has support for doing hybrid search, curious if anyone has tried it yet?

replies(1): >>42195777 #

30. simplecto ◴[20 Nov 24 14:07 UTC] No.42194037{6}[source]▶

>>42193942 #

because the space moves fast, and from my learning this is the current thing. Like fashion -- it changes from season to season

31. softwaredoug ◴[20 Nov 24 14:15 UTC] No.42194089[source]▶

My opinion is people need to not focus on one stack. But be prepared to use tools best for each job. Elasticsearch for BM25 type things. Turbopuffer for simple and fast vector retrieval. Even redis to precompute results for certain queries. Or certain extremely dynamic attributes that change frequently like price. Combine all these in a scatter/gather approach.

I say that because almost always you have a layer outside the search stack(s) that ideally can just be a straightforward inference service for reranking that looks most like other ML infra.

You also almost always route queries to different backends based on an understanding of the users query. Routing “lookup by ID” to a different system than “fuzzy semantic search”. These are very different data structures. And search almost always covers very broad/different use cases.

I think it’s an anti pattern to just push all work to one system. Each system is ideal for different workloads. And their inference capabilities won’t ever keep pace with the general ML tooling that your ML engineers are used to. (I tried with Elasticsearch Learning to Rank and its a hopeless task.)

(That said, Vespa is probably the best 'single stack' that tries to solve a broad range of use-cases.)

32. softwaredoug ◴[20 Nov 24 14:35 UTC] No.42194229[source]▶

I think there are also incentives to "sell new things". That's always been the case in search which has had a bazillion trends and "AI related things" as long as I've worked in it. We have massively VC funded vector search companies with armies of tech evangelists pushing a specific point of view right now.

Meanwhile, the amount of manual curation, basic, boring hand-curated taxonomies that actually drive things like "semantic search" at places like Google are simply staggering. Just nobody talks about them much at conferences because they're not very sexy.

33. softwaredoug ◴[20 Nov 24 14:42 UTC] No.42194290{4}[source]▶

>>42193450 #

I don't know a lot of search practitioners who don't want to use the "new sexy" thing. Most of us do a fair amount of "resume driven development" so can claim to be "AI Engineers" :)

replies(1): >>42195479 #

34. tselvaraj ◴[20 Nov 24 14:43 UTC] No.42194295[source]▶

https://github.com/softwaredoug/searcharray

Hybrid search solves the long-standing challenge of relevance with search results. We can use ranking fusion between keyword and vector to create a hybrid search that works in most scenarios.

35. softwaredoug ◴[20 Nov 24 14:44 UTC] No.42194312[source]▶

>>42191251 #

If we're shameless plugging passion projects, SearchArray is a pandas extension for fulltext (BM25) search for dorking around with things in google colab

I'll also plug Xing Han Lu's BM25S which is very popular with similar goals:

https://github.com/xhluca/bm25s

36. ordersofmag ◴[20 Nov 24 15:40 UTC] No.42194869{5}[source]▶

>>42193690 #

That's the 'vector search' people are talking about in this discussion. Use the LLM to generate an embedding vector that represents the 'meaning' of your query. Do the same for all the documents (or better with chunks of all the documents). Find the document vector that's closest to your query vector and you have a document that has a 'meaning' similar to your query. Obviously that's just a starting point. And lots of folks are doing hybrid where they combine bm25 search with some sort of vector search (e.g. run them in parallel and combine results, or do a bm25 and then use vector search to rerank the top results).

37. MPSimmons ◴[20 Nov 24 16:14 UTC] No.42195353[source]▶