←back to thread

Language Support for Marginalia Search

(www.marginalia.nu)

176 points Bogdanp | 1 comments | 21 Oct 25 06:48 UTC | HN request time: 0s | source

Show context

atombender ◴[22 Oct 25 22:36 UTC] No.45676039[source]▶

>>45653143 (OP) #

> Thankfully the BM-25 model used in ranking is robust to this, as it relies on live data from the index itself.

I'm confused by this. TD-IDF incorporates the term frequency (the IDF part), which search engines precompute for the index as a whole. But so does BM25; its IDF formula is slightly different, but also relies on term frequencies. What's the difference?

replies(1): >>45679022 #

1. marginalia_nu ◴[23 Oct 25 07:02 UTC] No.45679022[source]▶

The index has the most up-to-date term frequency information, but it is logistically inacessible, and it's not really practical to interrogate it when extracting keywords (as you need this information for 100 billion terms), so a somewhat stale version is kept in memory instead and used in that process.

When searching, doing BM25, it is a lot more accessible as you already fetch that information indirectly as part of looking up the documents lists, and this is typically only done up to about a dozen times per query.