Production RAG: what I learned from processing 5M+ documents

1. n_u ◴[20 Oct 25 17:26 UTC] No.45646587[source]▶

> Reranking: the highest value 5 lines of code you'll add. The chunk ranking shifted a lot. More than you'd expect. Reranking can many times make up for a bad setup if you pass in enough chunks. We found the ideal reranker set-up to be 50 chunk input -> 15 output.

What is re-ranking in the context of RAG? Why not just show the code if it’s only 5 lines?

replies(1): >>45646678 #

2. tifa2up ◴[20 Oct 25 17:34 UTC] No.45646678[source]▶

>>45646587 (TP) #

OP. Reranking is a specialized LLM that takes the user query, and a list of candidate results, then re-sets the order based on which ones are more relevant to the query.

Here's sample code: https://docs.cohere.com/reference/rerank

replies(1): >>45647377 #

3. yahoozoo ◴[20 Oct 25 18:29 UTC] No.45647377[source]▶

>>45646678 #

What is the difference between reranking versus generating text embeddings and comparing with cosine similarity?

replies(5): >>45647756 #>>45648506 #>>45649932 #>>45652751 #>>45655281 #

4. tifa2up ◴[20 Oct 25 18:58 UTC] No.45647756{3}[source]▶

>>45647377 #

text similarity finds items that closely match. Reranking my select items that are less semantically "similar" but are more relevant to the query.

5. derefr ◴[20 Oct 25 19:57 UTC] No.45648506{3}[source]▶

>>45647377 #

My understanding:

If you generate embeddings (of the query, and of the candidate documents) and compare them for similarity, you're essentially asking whether the documents "look like the question."

If you get an LLM to evaluate how well each candidate document follows from the query, you're asking whether the documents "look like an answer to the question."

An ideal candidate chunk/document from a cosine-similarity perspective, would be one that perfectly restates what the user said — whether or not that document actually helps the user. Which can be made to work, if you're e.g. indexing a knowledge base where every KB document is SEO-optimized to embed all pertinent questions a user might ask that "should lead" to that KB document. But for such documents, even matching the user's query text against a "dumb" tf-idf index will surface them. LLMs aren't gaining you any ground here. (As is evident by the fact that webpages SEO-optimized in this way could already be easily surfaced by old-school search engines if you typed such a query into them.)

An ideal candidate chunk/document from a re-ranking LLM's perspective, would be one that an instruction-following LLM (with the whole corpus in its context) would spit out as a response, if it were prompted with the user's query. E.g. if the user asks a question that could be answered with data, a document containing that data would rank highly. And that's exactly the kind of documents we'd like "semantic search" to surface.

replies(1): >>45651200 #

6. osigurdson ◴[20 Oct 25 21:56 UTC] No.45649932{3}[source]▶

>>45647377 #

Because LLMs are a lot smarter than embeddings and basic math. Think of the vector / lexical search as the first approximation.

7. Valk3_ ◴[21 Oct 25 00:42 UTC] No.45651200{4}[source]▶

>>45648506 #

I've been thinking about the problem of what to do if the answer to a question is very different to the question itself in embedding space. The KB method sounds interesting and not something I thought about, you sort work on the "document side" I guess. I've also heard of HYDE, the works on the query side, you generate hypothetical answers instead to the user query and look for documents that are similar to the answer, if I've understood it correctly.

8. hawthorns ◴[21 Oct 25 05:29 UTC] No.45652751{3}[source]▶

>>45647377 #

The main point didn't get hit on by the responses. Re-ranking is just a mini-LLM (for latency/cost reasons) that does a double heck. Embedding model finds the closest M documents in R^N space. Re-ranker picks the top K documents from the M documents. In theory, if we just used Gemini 2.5 Pro or GPT 5 as the re-ranker, the performance would even be better than whatever small re-ranker people choose to use.

9. PunchTornado ◴[21 Oct 25 13:06 UTC] No.45655281{3}[source]▶

>>45647377 #

the reranker is a cross encoder that sees the docs and the query at the same time. What you normally do is you generating embeddings ahead of time, independent of the prompt used, calculate cosine similarity with the prompt, select the top-k best chunks that match the prompt and only then use a reranker to sort them.

embeddings are a lossy compression, so if you feed the chunks with the prompt at the same time, the results are better. But you can't do this for your whole db, that's why the filtering with cosine similarity at the beginning.