Understanding the BM25 full text search algorithm

1. RA_Fisher ◴[20 Nov 24 10:47 UTC] No.42192651[source]▶

>>42190650 (OP) #

BM25 is an ancient algo developed in the 1970s. It’s basically a crappy statistical model and statisticians can do far better today. Search is strictly dominated by learning (that yes, can use search as an input). Not many folks realize that yet, and / or are incentivized to keep the old tech going as long as possible, but market pressures will change that.

replies(4): >>42192735 #>>42192805 #>>42192828 #>>42194229 #

2. netdur ◴[20 Nov 24 11:03 UTC] No.42192735[source]▶

>>42192651 (TP) #

While BM25 did emerge from earlier work in the 1970s and 1980s (specifically building on the probabilistic ranking principle), I'm curious about your perspective on a few things:

What specific modern statistical approaches are you seeing as superior replacements for BM25 in practical applications? I'm particularly interested in how they handle edge cases like rare terms and document length normalization that BM25 was explicitly designed to address.

While I agree learning-based approaches have shown impressive results, could you elaborate on what you mean by search being "strictly dominated" by learning methods? Are you referring to specific benchmarks or real-world applications?

replies(1): >>42193439 #

3. simplecto ◴[20 Nov 24 11:16 UTC] No.42192805[source]▶

>>42192651 (TP) #

Those are some really spicy opinions. It would seem that many search experts might not agree.

David Tippet (formerly opensearch and now at Github)

A great podcast with David Tippet and Nicolay Gerold entitled:

"BM25 is the workhorse of search; vectors are its visionary cousin"

https://www.youtube.com/watch?v=ENFW1uHsrLM

replies(2): >>42192855 #>>42193450 #

4. mrbungie ◴[20 Nov 24 11:20 UTC] No.42192828[source]▶

>>42192651 (TP) #

Are those the same market pressures that made Google discard or repurpose a lot of working old search tech for new shiny ML-based search tech? The same tech that makes you add "+reddit" in every search so you can evade the adversarial SEO war?

PS: Ancient != bad. I don't know what weird technologist take worries about the age of an invention/discovery of a technique rather than its usefulness.

replies(1): >>42193425 #

5. dumb1224 ◴[20 Nov 24 11:25 UTC] No.42192855[source]▶

>>42192805 #

Agreed. In the 2000s it was all about BM25 in the NLP community. I hardly see any paper that did not mention it in my opinion.

replies(2): >>42193496 #>>42193948 #

6. RA_Fisher ◴[20 Nov 24 12:42 UTC] No.42193425[source]▶

>>42192828 #

Google’s come a long way since PageRank + terms. Ancient doesn’t mean bad, but usually it means outdated and that’s the case here. Search algos are subsumed by learning models, our species can do better now.

replies(1): >>42193690 #

7. RA_Fisher ◴[20 Nov 24 12:43 UTC] No.42193439[source]▶

>>42192735 #

BM25 can be used as a starting point for a statistical learning model and more readily built on. A key advantage is that one gains a systematic way to reduce edge cases, instead of handling a couple, bc they’re so large as to be noticeable.

8. RA_Fisher ◴[20 Nov 24 12:46 UTC] No.42193450[source]▶

>>42192805 #

I’m sure Search experts would disagree, because it’d be their technology they’d be admitting is inferior to another. BM25 is the workhorse, no doubt— but it’s also not the best anymore. Vectors are a step toward learning models, but only a small mid-range step vs. an explicit model.

Search is a useful approach for computing learning models, but there’s a difference between the computational means and the model. For example, MIPS is a very useful search algo for computing learning models (but first the learning model has to be formulated).

replies(3): >>42193880 #>>42194290 #>>42197352 #

9. RA_Fisher ◴[20 Nov 24 12:52 UTC] No.42193496{3}[source]▶

>>42192855 #

For sure, it’s very popular, just not the best anymore (and actually far from it).

10. mbreese ◴[20 Nov 24 13:18 UTC] No.42193690{3}[source]▶

>>42193425 #

So, I’m not entirely sure if I follow you here… How would one use a language model to find a document out of a corpus of existing documents? As opposed to finding an answer to a question, trained on documents, which I can see. I mean answering a query like “find the report containing X”?

I see search as encompassing at least two separate, but related, domains: information gathering/seeking (answering a question) and information retrieval (find the best matching document). I’m curious how LLMs can help with the later.

replies(1): >>42194869 #

11. simplecto ◴[20 Nov 24 13:44 UTC] No.42193880{3}[source]▶

>>42193450 #

It seems that the current mode (eg fashion) is a hybrid approach, with vector results on one side, BM25 on the other, and then a re-reank algo to smooth things out.

I'm out of my depth here but genuinely interested and curious to see over the horizon.

replies(2): >>42193942 #>>42196684 #

12. authorfly ◴[20 Nov 24 13:53 UTC] No.42193942{4}[source]▶

>>42193880 #

Out of interest how come you use the word "mode" here?

replies(1): >>42194037 #

13. authorfly ◴[20 Nov 24 13:55 UTC] No.42193948{3}[source]▶

>>42192855 #

And dependency chaining. But yes, lots of BM25.

The 2000s and even 2010s was a wonderful and fairly theoretical time for linguistics and NLP. A time when NLP seemed to harbor real anonymized general information to make the right decisions with, without impinging on privacy.

Oh to go back.

14. simplecto ◴[20 Nov 24 14:07 UTC] No.42194037{5}[source]▶

>>42193942 #

because the space moves fast, and from my learning this is the current thing. Like fashion -- it changes from season to season

replies(1): >>42213940 #

15. softwaredoug ◴[20 Nov 24 14:35 UTC] No.42194229[source]▶

>>42192651 (TP) #

I think there are also incentives to "sell new things". That's always been the case in search which has had a bazillion trends and "AI related things" as long as I've worked in it. We have massively VC funded vector search companies with armies of tech evangelists pushing a specific point of view right now.

Meanwhile, the amount of manual curation, basic, boring hand-curated taxonomies that actually drive things like "semantic search" at places like Google are simply staggering. Just nobody talks about them much at conferences because they're not very sexy.

16. softwaredoug ◴[20 Nov 24 14:42 UTC] No.42194290{3}[source]▶

>>42193450 #

I don't know a lot of search practitioners who don't want to use the "new sexy" thing. Most of us do a fair amount of "resume driven development" so can claim to be "AI Engineers" :)

replies(1): >>42195479 #

17. ordersofmag ◴[20 Nov 24 15:40 UTC] No.42194869{4}[source]▶

>>42193690 #

That's the 'vector search' people are talking about in this discussion. Use the LLM to generate an embedding vector that represents the 'meaning' of your query. Do the same for all the documents (or better with chunks of all the documents). Find the document vector that's closest to your query vector and you have a document that has a 'meaning' similar to your query. Obviously that's just a starting point. And lots of folks are doing hybrid where they combine bm25 search with some sort of vector search (e.g. run them in parallel and combine results, or do a bm25 and then use vector search to rerank the top results).

18. RA_Fisher ◴[20 Nov 24 16:24 UTC] No.42195479{4}[source]▶

>>42194290 #

I don’t think it’s realistic to think that software engineers can pick up advanced statistical modeling on the job, unless they’re pairing with statisticians. There’s just too much background involved.

replies(2): >>42196352 #>>42197148 #

19. binarymax ◴[20 Nov 24 17:40 UTC] No.42196352{5}[source]▶

>>42195479 #

Your overall condescending attitude in this thread is really disgusting.

replies(1): >>42196528 #

20. RA_Fisher ◴[20 Nov 24 18:00 UTC] No.42196528{6}[source]▶

>>42196352 #

Statisticians are famously disliked, especially by engineers (there are open-minded folks, of course! maybe they’d taken some econometrics or statistics, are exceptionally humble, etc). There are some interesting motives and incentives around that. Sometimes I think in part it’s because many people would prefer their existing beliefs be upheld as opposed to challenged, even if they’re not well-supported (and likely to lead to bad decisions and outcomes). Sticking with outdated technology is one example.

21. RA_Fisher ◴[20 Nov 24 18:18 UTC] No.42196684{4}[source]▶

>>42193880 #

Best is to use one statistical model and encode the underlying aspects of the context that relate to goal outcomes.

22. softwaredoug ◴[20 Nov 24 19:14 UTC] No.42197148{5}[source]▶

>>42195479 #

The "search practitioners" I'm referring to are pretty uniformly ML Engineers . They also work on feeds, recommendations, and adjacent Information Retrieval spaces. Both to generate L0 retrieval candidates and to do higher layers of reranking with learning to rank and other systems to whatever the system's goal is...

You can decide if you agree that most people are sufficiently statistically literate in that group of people. But some humility around statistics is probably far up there in what I personally interview for.

replies(1): >>42197737 #

23. dtaivpp ◴[20 Nov 24 19:38 UTC] No.42197352{3}[source]▶

>>42193450 #

I have been summoned. Hey it's David from the podcast. As someone who builds search for users every day and shaped the user experience for vector search at OpenSearch I assure you no one is afraid of their technology becoming inferior.

There are two components of search that are really important to understand why BM25 (will likely) not go away for a long time. The first is precision and the second is recall. Precision is the measure of how many relevant results were returned in light of all the results returned. A completely precise search would return only the relevant results and no irrelevant results.

Recall on the other hand measures how many of all the relevant results were returned. For example, if our search only returns 5 results but we know that there were 10 relevant search results that should have been returned we would say the recall is 50%.

These two components are always at odds with each other. Vector search excels at increasing recall. It is able to find documents that are semantically similar. The problem with this is semantically similar documents might not actually be what the user is looking for. This is because vectors are only a representation of user intent.

Heres an example: A user looks up "AWS Config". Vector search would read this and may rate it as similar to ["amazon web services configuration", "cloud configuration", "infrastructure as a service setup"]. In this case the user was looking for a file called, "AWS.config". Vector search is inherently imprecise. It is getting better but it's not replacing BM25 as a scoring mechanism any time soon.

You don't have to believe me though. Weaviate, Vespa, Qdrant all support BM25 search for a reason. Here is an in depth blog that dives more into hybrid search: https://opensearch.org/blog/hybrid-search/

As an aside, vector search is also much more expensive than BM25. It's very hard to scale and get precise results.

replies(1): >>42197528 #

24. RA_Fisher ◴[20 Nov 24 19:59 UTC] No.42197528{4}[source]▶

>>42197352 #

Hi David. Nice to meet you. Yes, precision and recall are always in tension. However, both can be made simultaneously better with a more informed model. Using your example, this would be a model that encodes the concept of files in the context of a user demand surrounding AWS.

replies(1): >>42197936 #

25. RA_Fisher ◴[20 Nov 24 20:21 UTC] No.42197737{6}[source]▶

>>42197148 #

For sure. There are ML folks with statistical learning backgrounds, but it tends to be relatively rare. Physics and CS are more common. They tend to view things like you mention, more procedural eg- learning to rank, minimizing distances, less statistical modeling. Humility around statistics is good, but statistical knowledge is still what's required to really level up these systems (I've built them as well).

26. iosjunkie ◴[20 Nov 24 20:45 UTC] No.42197936{5}[source]▶

>>42197528 #

"more informed model"

Can you be specific on what you recommend instead of BM25?

replies(1): >>42200270 #

27. RA_Fisher ◴[21 Nov 24 01:55 UTC] No.42200270{6}[source]▶

>>42197936 #

Sure, this chat describes what I have in mind, https://chatgpt.com/share/673e9290-a044-8005-995b-166efe653e...

28. authorfly ◴[22 Nov 24 14:09 UTC] No.42213940{6}[source]▶

>>42194037 #

Oh right, I just wondered if it was a loan word from German. I am hearing it more and more in English.