Understanding the BM25 full text search algorithm

(emschwartz.me)

305 points rrampage | 4 comments | 20 Nov 24 03:43 UTC | HN request time: 0s | source

Show context

RA_Fisher ◴[20 Nov 24 10:47 UTC] No.42192651[source]▶

>>42190650 (OP) #

BM25 is an ancient algo developed in the 1970s. It’s basically a crappy statistical model and statisticians can do far better today. Search is strictly dominated by learning (that yes, can use search as an input). Not many folks realize that yet, and / or are incentivized to keep the old tech going as long as possible, but market pressures will change that.

replies(4): >>42192735 #>>42192805 #>>42192828 #>>42194229 #

1. mrbungie ◴[20 Nov 24 11:20 UTC] No.42192828[source]▶

>>42192651 #

Are those the same market pressures that made Google discard or repurpose a lot of working old search tech for new shiny ML-based search tech? The same tech that makes you add "+reddit" in every search so you can evade the adversarial SEO war?

PS: Ancient != bad. I don't know what weird technologist take worries about the age of an invention/discovery of a technique rather than its usefulness.

replies(1): >>42193425 #

2. RA_Fisher ◴[20 Nov 24 12:42 UTC] No.42193425[source]▶

>>42192828 (TP) #

Google’s come a long way since PageRank + terms. Ancient doesn’t mean bad, but usually it means outdated and that’s the case here. Search algos are subsumed by learning models, our species can do better now.

replies(1): >>42193690 #

3. mbreese ◴[20 Nov 24 13:18 UTC] No.42193690[source]▶

>>42193425 #

So, I’m not entirely sure if I follow you here… How would one use a language model to find a document out of a corpus of existing documents? As opposed to finding an answer to a question, trained on documents, which I can see. I mean answering a query like “find the report containing X”?

I see search as encompassing at least two separate, but related, domains: information gathering/seeking (answering a question) and information retrieval (find the best matching document). I’m curious how LLMs can help with the later.

replies(1): >>42194869 #

4. ordersofmag ◴[20 Nov 24 15:40 UTC] No.42194869{3}[source]▶

>>42193690 #

That's the 'vector search' people are talking about in this discussion. Use the LLM to generate an embedding vector that represents the 'meaning' of your query. Do the same for all the documents (or better with chunks of all the documents). Find the document vector that's closest to your query vector and you have a document that has a 'meaning' similar to your query. Obviously that's just a starting point. And lots of folks are doing hybrid where they combine bm25 search with some sort of vector search (e.g. run them in parallel and combine results, or do a bm25 and then use vector search to rerank the top results).

↑