Understanding the BM25 full text search algorithm

(emschwartz.me)

305 points rrampage | 2 comments | 20 Nov 24 03:43 UTC | HN request time: 0.394s | source

Show context

RA_Fisher ◴[20 Nov 24 10:47 UTC] No.42192651[source]▶

>>42190650 (OP) #

BM25 is an ancient algo developed in the 1970s. It’s basically a crappy statistical model and statisticians can do far better today. Search is strictly dominated by learning (that yes, can use search as an input). Not many folks realize that yet, and / or are incentivized to keep the old tech going as long as possible, but market pressures will change that.

replies(4): >>42192735 #>>42192805 #>>42192828 #>>42194229 #

1. netdur ◴[20 Nov 24 11:03 UTC] No.42192735[source]▶

>>42192651 #

While BM25 did emerge from earlier work in the 1970s and 1980s (specifically building on the probabilistic ranking principle), I'm curious about your perspective on a few things:

What specific modern statistical approaches are you seeing as superior replacements for BM25 in practical applications? I'm particularly interested in how they handle edge cases like rare terms and document length normalization that BM25 was explicitly designed to address.

While I agree learning-based approaches have shown impressive results, could you elaborate on what you mean by search being "strictly dominated" by learning methods? Are you referring to specific benchmarks or real-world applications?

replies(1): >>42193439 #

2. RA_Fisher ◴[20 Nov 24 12:43 UTC] No.42193439[source]▶

>>42192735 (TP) #

BM25 can be used as a starting point for a statistical learning model and more readily built on. A key advantage is that one gains a systematic way to reduce edge cases, instead of handling a couple, bc they’re so large as to be noticeable.

↑