A few more details/background that are harder to find: "BM25" stands for "Best Matching 25", "best matching" becaue it is a formula for ranking and term weighting (the matching refers to the term in the query versus the document), and the number 25 simply indicates a running number (there were 24 earlier formula variants and some later ones, but #25 turned out to work best, so it was the one that was published).
It was conceived by Stephen Robertson and Karen Spärck Jones (the latter of IDF fame) and first implemented in the former's OKAPI information retrieval (research) system. The OKAPI system was benchmarked at the annual US NIST TREC (Text Retrieval Conference) for a number of years, the international "World Champtionship" of search engine methods (although the event is not about winning, but about compariing notes and learning from each other, a highly recommended annual event held every November in Gaithersburg, Maryland, attended by global academic and industry teams that conduct research on improving search - see trec.nist.gov).
Besides the "bag of words" Vector Space Model (sparse vectors of terms), the Probabilistic Modles (that BM25 belongs to), there are suprising and still growing number of other theoretical frameworks how to rank a set of documents, given a query ("Divergence from Randomness", "Statistical Language Modeling, "Learning to Rank", "Quantum Information Retrieval", "Neural Ranking" etc.). Conferences like ICTIR and SIGIR still publish occasionaly entirely new paradigms for search. Note that the "Statistical Language Modeling" paradigm is not about Large Language Models that are on vogue now (that's covered under the "Neural Retrieval" umbrella), and that "Quantum IR" is not going to get you to a tutorial about Quantum Information Retrieval but to methods of infrared spectroscopy or a company with the same name that produces cement; such are the intricacies of search technology, even in the 21st century.
If you want to play with BM25 and compare it with some of the alternatives, I recommend the research platform Terrier, and open-source search engine developed at the University of Glasgow (today, perhaps the epicenter of search research).
BM25 is over a quarter century old, but has proven to be a hard baseline to beat (it is still often used as a reference point for comparing new nethods against), and a more recent variant, BM24F, can deal with multiple fields and hypertext (e.g. title, body of documents, hyperlinks).
The recommended paper to read is: Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 1". Information Processing & Management 36(6): 779–808, and its successor, Part 2. (Sadly they are not open access.)
> BM25F (or the BM25 model with Extension to Multiple Weighted Fields) is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) https://en.wikipedia.org/wiki/Okapi_BM25
Some papers are linked in the references
What specific modern statistical approaches are you seeing as superior replacements for BM25 in practical applications? I'm particularly interested in how they handle edge cases like rare terms and document length normalization that BM25 was explicitly designed to address.
While I agree learning-based approaches have shown impressive results, could you elaborate on what you mean by search being "strictly dominated" by learning methods? Are you referring to specific benchmarks or real-world applications?
David Tippet (formerly opensearch and now at Github)
A great podcast with David Tippet and Nicolay Gerold entitled:
"BM25 is the workhorse of search; vectors are its visionary cousin"
Do you know of any pure Python wrapper projects for managing large numbers of text and PDF documents? I thought of using Solr or ElasticSearch but that seems too heavy weight for what I am doing. I am considering using SQLite with pysqlite3 and PyPDF2 since SQLite uses BM25. Sorry to be off topic, but I imagine many people are looking at tools for building hybrid BM25 / vector store / LLM applications.
PS: Ancient != bad. I don't know what weird technologist take worries about the age of an invention/discovery of a technique rather than its usefulness.
The RRF paper is impressive in how incredibly simple it is (the paper is only 2 pages): https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf
Search is a useful approach for computing learning models, but there’s a difference between the computational means and the model. For example, MIPS is a very useful search algo for computing learning models (but first the learning model has to be formulated).
I see search as encompassing at least two separate, but related, domains: information gathering/seeking (answering a question) and information retrieval (find the best matching document). I’m curious how LLMs can help with the later.
[1] Langroid - a multi-agent LLM framework from CMU/UW-Madison researchers https://github.com/langroid/langroid
[2] DocChatAgent Implementation - https://github.com/langroid/langroid/blob/main/langroid/agen...
Start with the answer_from_docs method and follow the trail.
Incidentally I see you're the founder of Kadoa -- Kadoa-snack is one of favorite daily tools to find LLM-related HN discussions!
So I don't know the answer, but I was recently handed about 3 million surveys with 10 free-form writing fields each, and tasked with surfacing the ones that might require action on the part of the company. I chose to use a couple of different small classifier models, manually strip out some common words based on obvious noise in the first 10k results, and then weight the model responses. It turned out to be almost flawless. I would NOT call this sort of thing "programming", it's more just tweaking the black-box output of various different tools until you have a set of results that looks good for your test cases. (And your client ;)
All stitching together small Hugging Face models running on a tiny server in nodejs, btw.
I'm out of my depth here but genuinely interested and curious to see over the horizon.
Author of txtai [1] here. txtai implements a performant BM25 index in Python [2] via the arrays package and storing the term frequency vectors in SQLite.
With txtai, the hybrid index approach [3] supports both convex combination when BM25 scores are normalized and reciprocal rank fusion (RRF) when they aren't [4].
[1] https://github.com/neuml/txtai
[2] https://neuml.hashnode.dev/building-an-efficient-sparse-keyw...
[3] https://neuml.hashnode.dev/benefits-of-hybrid-search
[4] https://github.com/neuml/txtai/blob/master/src/python/txtai/...
The 2000s and even 2010s was a wonderful and fairly theoretical time for linguistics and NLP. A time when NLP seemed to harbor real anonymized general information to make the right decisions with, without impinging on privacy.
Oh to go back.
I say that because almost always you have a layer outside the search stack(s) that ideally can just be a straightforward inference service for reranking that looks most like other ML infra.
You also almost always route queries to different backends based on an understanding of the users query. Routing “lookup by ID” to a different system than “fuzzy semantic search”. These are very different data structures. And search almost always covers very broad/different use cases.
I think it’s an anti pattern to just push all work to one system. Each system is ideal for different workloads. And their inference capabilities won’t ever keep pace with the general ML tooling that your ML engineers are used to. (I tried with Elasticsearch Learning to Rank and its a hopeless task.)
(That said, Vespa is probably the best 'single stack' that tries to solve a broad range of use-cases.)
Meanwhile, the amount of manual curation, basic, boring hand-curated taxonomies that actually drive things like "semantic search" at places like Google are simply staggering. Just nobody talks about them much at conferences because they're not very sexy.
https://github.com/softwaredoug/searcharray
I'll also plug Xing Han Lu's BM25S which is very popular with similar goals:
Do you know of a search dataset with very large document length differences? MSMarco for example is pretty consistent in length.
Overall I'm really happy to see Typesense mentioned here.
A lot of the smaller scale RAG projects, etc you see around would be well served by Typesense but it seems to be relatively unknown for whatever reasons. It's probably one of the easiest solutions to deploy, has reasonable defaults, good docs, easy clustering, etc while still be very capable, performant, and powerful if you need to dig in further.
Link for more details: https://trec.nist.gov/
I had actually implemented full text search + vector search using RRF but I kept it disabled by default because it wasn't meaningfully improving my results. This seems like a good hypothesis as to why.
You can decide if you agree that most people are sufficiently statistically literate in that group of people. But some humility around statistics is probably far up there in what I personally interview for.
There are two components of search that are really important to understand why BM25 (will likely) not go away for a long time. The first is precision and the second is recall. Precision is the measure of how many relevant results were returned in light of all the results returned. A completely precise search would return only the relevant results and no irrelevant results.
Recall on the other hand measures how many of all the relevant results were returned. For example, if our search only returns 5 results but we know that there were 10 relevant search results that should have been returned we would say the recall is 50%.
These two components are always at odds with each other. Vector search excels at increasing recall. It is able to find documents that are semantically similar. The problem with this is semantically similar documents might not actually be what the user is looking for. This is because vectors are only a representation of user intent.
Heres an example: A user looks up "AWS Config". Vector search would read this and may rate it as similar to ["amazon web services configuration", "cloud configuration", "infrastructure as a service setup"]. In this case the user was looking for a file called, "AWS.config". Vector search is inherently imprecise. It is getting better but it's not replacing BM25 as a scoring mechanism any time soon.
You don't have to believe me though. Weaviate, Vespa, Qdrant all support BM25 search for a reason. Here is an in depth blog that dives more into hybrid search: https://opensearch.org/blog/hybrid-search/
As an aside, vector search is also much more expensive than BM25. It's very hard to scale and get precise results.
You can store your text and PDFs in SQLite (or their filenames) and use the FTS5 infrastructure to do tokenization, query execution, and ranking. You can write your own tokenizer in Python, as well as ranking functions. A pure Python tokenizer for HTML is included, as well as a pure Python implementation of BM25.
You can chain tokenizers so it is just a few lines of code to call pypdf's extract_text method, and then have the bundled UnicodeWords tokenizer properly extract tokens/words, and Simplify to do case folding and accent stripping if desired.
There is a lot more useful functionality, all done from Python. You can see code in action in the example/tour at https://rogerbinns.github.io/apsw/example-fts.html