Production RAG: what I learned from processing 5M+ documents

(blog.abdellatif.io)

548 points tifa2up | 1 comments | 20 Oct 25 15:55 UTC | HN request time: 0.207s | source

Show context

n_u ◴[20 Oct 25 17:26 UTC] No.45646587[source]▶

> Reranking: the highest value 5 lines of code you'll add. The chunk ranking shifted a lot. More than you'd expect. Reranking can many times make up for a bad setup if you pass in enough chunks. We found the ideal reranker set-up to be 50 chunk input -> 15 output.

What is re-ranking in the context of RAG? Why not just show the code if it’s only 5 lines?

replies(1): >>45646678 #

tifa2up ◴[20 Oct 25 17:34 UTC] No.45646678[source]▶

>>45646587 #

OP. Reranking is a specialized LLM that takes the user query, and a list of candidate results, then re-sets the order based on which ones are more relevant to the query.

Here's sample code: https://docs.cohere.com/reference/rerank

replies(1): >>45647377 #

yahoozoo ◴[20 Oct 25 18:29 UTC] No.45647377[source]▶

>>45646678 #

What is the difference between reranking versus generating text embeddings and comparing with cosine similarity?

replies(5): >>45647756 #>>45648506 #>>45649932 #>>45652751 #>>45655281 #

1. PunchTornado ◴[21 Oct 25 13:06 UTC] No.45655281[source]▶

>>45647377 #

the reranker is a cross encoder that sees the docs and the query at the same time. What you normally do is you generating embeddings ahead of time, independent of the prompt used, calculate cosine similarity with the prompt, select the top-k best chunks that match the prompt and only then use a reranker to sort them.

embeddings are a lossy compression, so if you feed the chunks with the prompt at the same time, the results are better. But you can't do this for your whole db, that's why the filtering with cosine similarity at the beginning.

↑