Understanding the BM25 full text search algorithm

(emschwartz.me)

305 points rrampage | 1 comments | 20 Nov 24 03:43 UTC | HN request time: 0.765s | source

Show context

RA_Fisher ◴[20 Nov 24 10:47 UTC] No.42192651[source]▶

>>42190650 (OP) #

BM25 is an ancient algo developed in the 1970s. It’s basically a crappy statistical model and statisticians can do far better today. Search is strictly dominated by learning (that yes, can use search as an input). Not many folks realize that yet, and / or are incentivized to keep the old tech going as long as possible, but market pressures will change that.

replies(4): >>42192735 #>>42192805 #>>42192828 #>>42194229 #

simplecto ◴[20 Nov 24 11:16 UTC] No.42192805[source]▶

>>42192651 #

Those are some really spicy opinions. It would seem that many search experts might not agree.

David Tippet (formerly opensearch and now at Github)

A great podcast with David Tippet and Nicolay Gerold entitled:

"BM25 is the workhorse of search; vectors are its visionary cousin"

https://www.youtube.com/watch?v=ENFW1uHsrLM

replies(2): >>42192855 #>>42193450 #

RA_Fisher ◴[20 Nov 24 12:46 UTC] No.42193450[source]▶

>>42192805 #

I’m sure Search experts would disagree, because it’d be their technology they’d be admitting is inferior to another. BM25 is the workhorse, no doubt— but it’s also not the best anymore. Vectors are a step toward learning models, but only a small mid-range step vs. an explicit model.

Search is a useful approach for computing learning models, but there’s a difference between the computational means and the model. For example, MIPS is a very useful search algo for computing learning models (but first the learning model has to be formulated).

replies(3): >>42193880 #>>42194290 #>>42197352 #

softwaredoug ◴[20 Nov 24 14:42 UTC] No.42194290[source]▶

>>42193450 #

I don't know a lot of search practitioners who don't want to use the "new sexy" thing. Most of us do a fair amount of "resume driven development" so can claim to be "AI Engineers" :)

replies(1): >>42195479 #

RA_Fisher ◴[20 Nov 24 16:24 UTC] No.42195479[source]▶

>>42194290 #

I don’t think it’s realistic to think that software engineers can pick up advanced statistical modeling on the job, unless they’re pairing with statisticians. There’s just too much background involved.

replies(2): >>42196352 #>>42197148 #

softwaredoug ◴[20 Nov 24 19:14 UTC] No.42197148[source]▶

>>42195479 #

The "search practitioners" I'm referring to are pretty uniformly ML Engineers . They also work on feeds, recommendations, and adjacent Information Retrieval spaces. Both to generate L0 retrieval candidates and to do higher layers of reranking with learning to rank and other systems to whatever the system's goal is...

You can decide if you agree that most people are sufficiently statistically literate in that group of people. But some humility around statistics is probably far up there in what I personally interview for.

replies(1): >>42197737 #

1. RA_Fisher ◴[20 Nov 24 20:21 UTC] No.42197737[source]▶

>>42197148 #

For sure. There are ML folks with statistical learning backgrounds, but it tends to be relatively rare. Physics and CS are more common. They tend to view things like you mention, more procedural eg- learning to rank, minimizing distances, less statistical modeling. Humility around statistics is good, but statistical knowledge is still what's required to really level up these systems (I've built them as well).

↑