Understanding the BM25 full text search algorithm

Nice write-up.

A few more details/background that are harder to find: "BM25" stands for "Best Matching 25", "best matching" becaue it is a formula for ranking and term weighting (the matching refers to the term in the query versus the document), and the number 25 simply indicates a running number (there were 24 earlier formula variants and some later ones, but #25 turned out to work best, so it was the one that was published).

It was conceived by Stephen Robertson and Karen Spärck Jones (the latter of IDF fame) and first implemented in the former's OKAPI information retrieval (research) system. The OKAPI system was benchmarked at the annual US NIST TREC (Text Retrieval Conference) for a number of years, the international "World Champtionship" of search engine methods (although the event is not about winning, but about compariing notes and learning from each other, a highly recommended annual event held every November in Gaithersburg, Maryland, attended by global academic and industry teams that conduct research on improving search - see trec.nist.gov).

Besides the "bag of words" Vector Space Model (sparse vectors of terms), the Probabilistic Modles (that BM25 belongs to), there are suprising and still growing number of other theoretical frameworks how to rank a set of documents, given a query ("Divergence from Randomness", "Statistical Language Modeling, "Learning to Rank", "Quantum Information Retrieval", "Neural Ranking" etc.). Conferences like ICTIR and SIGIR still publish occasionaly entirely new paradigms for search. Note that the "Statistical Language Modeling" paradigm is not about Large Language Models that are on vogue now (that's covered under the "Neural Retrieval" umbrella), and that "Quantum IR" is not going to get you to a tutorial about Quantum Information Retrieval but to methods of infrared spectroscopy or a company with the same name that produces cement; such are the intricacies of search technology, even in the 21st century.

If you want to play with BM25 and compare it with some of the alternatives, I recommend the research platform Terrier, and open-source search engine developed at the University of Glasgow (today, perhaps the epicenter of search research).

BM25 is over a quarter century old, but has proven to be a hard baseline to beat (it is still often used as a reference point for comparing new nethods against), and a more recent variant, BM24F, can deal with multiple fields and hypertext (e.g. title, body of documents, hyperlinks).

The recommended paper to read is: Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 1". Information Processing & Management 36(6): 779–808, and its successor, Part 2. (Sadly they are not open access.)