Most active commenters

RA_Fisher(9)
simplecto(3)
authorfly(3)

Popular/hot comments

>>42193450 #

←back to thread

Understanding the BM25 full text search algorithm

(emschwartz.me)

Show context

RA_Fisher ◴[20 Nov 24 10:47 UTC] No.42192651[source]▶

>>42190650 (OP) #

BM25 is an ancient algo developed in the 1970s. It’s basically a crappy statistical model and statisticians can do far better today. Search is strictly dominated by learning (that yes, can use search as an input). Not many folks realize that yet, and / or are incentivized to keep the old tech going as long as possible, but market pressures will change that.

replies(4): >>42192735 #>>42192805 #>>42192828 #>>42194229 #

1. simplecto ◴[20 Nov 24 11:16 UTC] No.42192805[source]▶

>>42192651 #

Those are some really spicy opinions. It would seem that many search experts might not agree.

David Tippet (formerly opensearch and now at Github)

A great podcast with David Tippet and Nicolay Gerold entitled:

"BM25 is the workhorse of search; vectors are its visionary cousin"

https://www.youtube.com/watch?v=ENFW1uHsrLM

replies(2): >>42192855 #>>42193450 #

2. dumb1224 ◴[20 Nov 24 11:25 UTC] No.42192855[source]▶

>>42192805 (TP) #

Agreed. In the 2000s it was all about BM25 in the NLP community. I hardly see any paper that did not mention it in my opinion.

replies(2): >>42193496 #>>42193948 #

3. RA_Fisher ◴[20 Nov 24 12:46 UTC] No.42193450[source]▶

>>42192805 (TP) #

I’m sure Search experts would disagree, because it’d be their technology they’d be admitting is inferior to another. BM25 is the workhorse, no doubt— but it’s also not the best anymore. Vectors are a step toward learning models, but only a small mid-range step vs. an explicit model.

Search is a useful approach for computing learning models, but there’s a difference between the computational means and the model. For example, MIPS is a very useful search algo for computing learning models (but first the learning model has to be formulated).

replies(3): >>42193880 #>>42194290 #>>42197352 #

4. RA_Fisher ◴[20 Nov 24 12:52 UTC] No.42193496[source]▶

>>42192855 #

For sure, it’s very popular, just not the best anymore (and actually far from it).

5. simplecto ◴[20 Nov 24 13:44 UTC] No.42193880[source]▶

>>42193450 #

It seems that the current mode (eg fashion) is a hybrid approach, with vector results on one side, BM25 on the other, and then a re-reank algo to smooth things out.

I'm out of my depth here but genuinely interested and curious to see over the horizon.

replies(2): >>42193942 #>>42196684 #

6. authorfly ◴[20 Nov 24 13:53 UTC] No.42193942{3}[source]▶

>>42193880 #

Out of interest how come you use the word "mode" here?

replies(1): >>42194037 #

7. authorfly ◴[20 Nov 24 13:55 UTC] No.42193948[source]▶

>>42192855 #

And dependency chaining. But yes, lots of BM25.

The 2000s and even 2010s was a wonderful and fairly theoretical time for linguistics and NLP. A time when NLP seemed to harbor real anonymized general information to make the right decisions with, without impinging on privacy.

Oh to go back.

8. simplecto ◴[20 Nov 24 14:07 UTC] No.42194037{4}[source]▶

>>42193942 #

because the space moves fast, and from my learning this is the current thing. Like fashion -- it changes from season to season

replies(1): >>42213940 #

9. softwaredoug ◴[20 Nov 24 14:42 UTC] No.42194290[source]▶

>>42193450 #

I don't know a lot of search practitioners who don't want to use the "new sexy" thing. Most of us do a fair amount of "resume driven development" so can claim to be "AI Engineers" :)

replies(1): >>42195479 #

10. RA_Fisher ◴[20 Nov 24 16:24 UTC] No.42195479{3}[source]▶

>>42194290 #

I don’t think it’s realistic to think that software engineers can pick up advanced statistical modeling on the job, unless they’re pairing with statisticians. There’s just too much background involved.

replies(2): >>42196352 #>>42197148 #

11. binarymax ◴[20 Nov 24 17:40 UTC] No.42196352{4}[source]▶

>>42195479 #

Your overall condescending attitude in this thread is really disgusting.

replies(1): >>42196528 #

12. RA_Fisher ◴[20 Nov 24 18:00 UTC] No.42196528{5}[source]▶

>>42196352 #

Statisticians are famously disliked, especially by engineers (there are open-minded folks, of course! maybe they’d taken some econometrics or statistics, are exceptionally humble, etc). There are some interesting motives and incentives around that. Sometimes I think in part it’s because many people would prefer their existing beliefs be upheld as opposed to challenged, even if they’re not well-supported (and likely to lead to bad decisions and outcomes). Sticking with outdated technology is one example.

13. RA_Fisher ◴[20 Nov 24 18:18 UTC] No.42196684{3}[source]▶

>>42193880 #

Best is to use one statistical model and encode the underlying aspects of the context that relate to goal outcomes.

14. softwaredoug ◴[20 Nov 24 19:14 UTC] No.42197148{4}[source]▶

>>42195479 #

The "search practitioners" I'm referring to are pretty uniformly ML Engineers . They also work on feeds, recommendations, and adjacent Information Retrieval spaces. Both to generate L0 retrieval candidates and to do higher layers of reranking with learning to rank and other systems to whatever the system's goal is...

You can decide if you agree that most people are sufficiently statistically literate in that group of people. But some humility around statistics is probably far up there in what I personally interview for.

replies(1): >>42197737 #

15. dtaivpp ◴[20 Nov 24 19:38 UTC] No.42197352[source]▶

>>42193450 #

I have been summoned. Hey it's David from the podcast. As someone who builds search for users every day and shaped the user experience for vector search at OpenSearch I assure you no one is afraid of their technology becoming inferior.

There are two components of search that are really important to understand why BM25 (will likely) not go away for a long time. The first is precision and the second is recall. Precision is the measure of how many relevant results were returned in light of all the results returned. A completely precise search would return only the relevant results and no irrelevant results.

Recall on the other hand measures how many of all the relevant results were returned. For example, if our search only returns 5 results but we know that there were 10 relevant search results that should have been returned we would say the recall is 50%.

These two components are always at odds with each other. Vector search excels at increasing recall. It is able to find documents that are semantically similar. The problem with this is semantically similar documents might not actually be what the user is looking for. This is because vectors are only a representation of user intent.

Heres an example: A user looks up "AWS Config". Vector search would read this and may rate it as similar to ["amazon web services configuration", "cloud configuration", "infrastructure as a service setup"]. In this case the user was looking for a file called, "AWS.config". Vector search is inherently imprecise. It is getting better but it's not replacing BM25 as a scoring mechanism any time soon.

You don't have to believe me though. Weaviate, Vespa, Qdrant all support BM25 search for a reason. Here is an in depth blog that dives more into hybrid search: https://opensearch.org/blog/hybrid-search/

As an aside, vector search is also much more expensive than BM25. It's very hard to scale and get precise results.

replies(1): >>42197528 #

16. RA_Fisher ◴[20 Nov 24 19:59 UTC] No.42197528{3}[source]▶

>>42197352 #

Hi David. Nice to meet you. Yes, precision and recall are always in tension. However, both can be made simultaneously better with a more informed model. Using your example, this would be a model that encodes the concept of files in the context of a user demand surrounding AWS.

replies(1): >>42197936 #

17. RA_Fisher ◴[20 Nov 24 20:21 UTC] No.42197737{5}[source]▶

>>42197148 #

For sure. There are ML folks with statistical learning backgrounds, but it tends to be relatively rare. Physics and CS are more common. They tend to view things like you mention, more procedural eg- learning to rank, minimizing distances, less statistical modeling. Humility around statistics is good, but statistical knowledge is still what's required to really level up these systems (I've built them as well).

18. iosjunkie ◴[20 Nov 24 20:45 UTC] No.42197936{4}[source]▶

>>42197528 #

"more informed model"

Can you be specific on what you recommend instead of BM25?

replies(1): >>42200270 #

19. RA_Fisher ◴[21 Nov 24 01:55 UTC] No.42200270{5}[source]▶

>>42197936 #

Sure, this chat describes what I have in mind, https://chatgpt.com/share/673e9290-a044-8005-995b-166efe653e...

20. authorfly ◴[22 Nov 24 14:09 UTC] No.42213940{5}[source]▶

>>42194037 #

Oh right, I just wondered if it was a loan word from German. I am hearing it more and more in English.

↑