←back to thread

176 points Bogdanp | 1 comments | | HN request time: 0s | source
Show context
atombender ◴[] No.45676039[source]
> Thankfully the BM-25 model used in ranking is robust to this, as it relies on live data from the index itself.

I'm confused by this. TD-IDF incorporates the term frequency (the IDF part), which search engines precompute for the index as a whole. But so does BM25; its IDF formula is slightly different, but also relies on term frequencies. What's the difference?

replies(1): >>45679022 #
1. marginalia_nu ◴[] No.45679022[source]
The index has the most up-to-date term frequency information, but it is logistically inacessible, and it's not really practical to interrogate it when extracting keywords (as you need this information for 100 billion terms), so a somewhat stale version is kept in memory instead and used in that process.

When searching, doing BM25, it is a lot more accessible as you already fetch that information indirectly as part of looking up the documents lists, and this is typically only done up to about a dozen times per query.