Understanding the BM25 full text search algorithm

(emschwartz.me)

305 points rrampage | 4 comments | 20 Nov 24 03:43 UTC | HN request time: 1.193s | source

Show context

RA_Fisher ◴[20 Nov 24 10:47 UTC] No.42192651[source]▶

>>42190650 (OP) #

BM25 is an ancient algo developed in the 1970s. It’s basically a crappy statistical model and statisticians can do far better today. Search is strictly dominated by learning (that yes, can use search as an input). Not many folks realize that yet, and / or are incentivized to keep the old tech going as long as possible, but market pressures will change that.

replies(4): >>42192735 #>>42192805 #>>42192828 #>>42194229 #

simplecto ◴[20 Nov 24 11:16 UTC] No.42192805[source]▶

>>42192651 #

Those are some really spicy opinions. It would seem that many search experts might not agree.

David Tippet (formerly opensearch and now at Github)

A great podcast with David Tippet and Nicolay Gerold entitled:

"BM25 is the workhorse of search; vectors are its visionary cousin"

https://www.youtube.com/watch?v=ENFW1uHsrLM

replies(2): >>42192855 #>>42193450 #

RA_Fisher ◴[20 Nov 24 12:46 UTC] No.42193450[source]▶

>>42192805 #

I’m sure Search experts would disagree, because it’d be their technology they’d be admitting is inferior to another. BM25 is the workhorse, no doubt— but it’s also not the best anymore. Vectors are a step toward learning models, but only a small mid-range step vs. an explicit model.

Search is a useful approach for computing learning models, but there’s a difference between the computational means and the model. For example, MIPS is a very useful search algo for computing learning models (but first the learning model has to be formulated).

replies(3): >>42193880 #>>42194290 #>>42197352 #

1. dtaivpp ◴[20 Nov 24 19:38 UTC] No.42197352[source]▶

>>42193450 #

I have been summoned. Hey it's David from the podcast. As someone who builds search for users every day and shaped the user experience for vector search at OpenSearch I assure you no one is afraid of their technology becoming inferior.

There are two components of search that are really important to understand why BM25 (will likely) not go away for a long time. The first is precision and the second is recall. Precision is the measure of how many relevant results were returned in light of all the results returned. A completely precise search would return only the relevant results and no irrelevant results.

Recall on the other hand measures how many of all the relevant results were returned. For example, if our search only returns 5 results but we know that there were 10 relevant search results that should have been returned we would say the recall is 50%.

These two components are always at odds with each other. Vector search excels at increasing recall. It is able to find documents that are semantically similar. The problem with this is semantically similar documents might not actually be what the user is looking for. This is because vectors are only a representation of user intent.

Heres an example: A user looks up "AWS Config". Vector search would read this and may rate it as similar to ["amazon web services configuration", "cloud configuration", "infrastructure as a service setup"]. In this case the user was looking for a file called, "AWS.config". Vector search is inherently imprecise. It is getting better but it's not replacing BM25 as a scoring mechanism any time soon.

You don't have to believe me though. Weaviate, Vespa, Qdrant all support BM25 search for a reason. Here is an in depth blog that dives more into hybrid search: https://opensearch.org/blog/hybrid-search/

As an aside, vector search is also much more expensive than BM25. It's very hard to scale and get precise results.

replies(1): >>42197528 #

2. RA_Fisher ◴[20 Nov 24 19:59 UTC] No.42197528[source]▶

>>42197352 (TP) #

Hi David. Nice to meet you. Yes, precision and recall are always in tension. However, both can be made simultaneously better with a more informed model. Using your example, this would be a model that encodes the concept of files in the context of a user demand surrounding AWS.

replies(1): >>42197936 #

3. iosjunkie ◴[20 Nov 24 20:45 UTC] No.42197936[source]▶

>>42197528 #

"more informed model"

Can you be specific on what you recommend instead of BM25?

replies(1): >>42200270 #

4. RA_Fisher ◴[21 Nov 24 01:55 UTC] No.42200270{3}[source]▶

>>42197936 #

Sure, this chat describes what I have in mind, https://chatgpt.com/share/673e9290-a044-8005-995b-166efe653e...

↑