Production RAG: what I learned from processing 5M+ documents

(blog.abdellatif.io)

548 points tifa2up | 2 comments | 20 Oct 25 15:55 UTC | HN request time: 0.455s | source

Show context

n_u ◴[20 Oct 25 17:26 UTC] No.45646587[source]▶

> Reranking: the highest value 5 lines of code you'll add. The chunk ranking shifted a lot. More than you'd expect. Reranking can many times make up for a bad setup if you pass in enough chunks. We found the ideal reranker set-up to be 50 chunk input -> 15 output.

What is re-ranking in the context of RAG? Why not just show the code if it’s only 5 lines?

replies(1): >>45646678 #

tifa2up ◴[20 Oct 25 17:34 UTC] No.45646678[source]▶

>>45646587 #

OP. Reranking is a specialized LLM that takes the user query, and a list of candidate results, then re-sets the order based on which ones are more relevant to the query.

Here's sample code: https://docs.cohere.com/reference/rerank

replies(1): >>45647377 #

yahoozoo ◴[20 Oct 25 18:29 UTC] No.45647377[source]▶

>>45646678 #

What is the difference between reranking versus generating text embeddings and comparing with cosine similarity?

replies(5): >>45647756 #>>45648506 #>>45649932 #>>45652751 #>>45655281 #

1. derefr ◴[20 Oct 25 19:57 UTC] No.45648506[source]▶

>>45647377 #

My understanding:

If you generate embeddings (of the query, and of the candidate documents) and compare them for similarity, you're essentially asking whether the documents "look like the question."

If you get an LLM to evaluate how well each candidate document follows from the query, you're asking whether the documents "look like an answer to the question."

An ideal candidate chunk/document from a cosine-similarity perspective, would be one that perfectly restates what the user said — whether or not that document actually helps the user. Which can be made to work, if you're e.g. indexing a knowledge base where every KB document is SEO-optimized to embed all pertinent questions a user might ask that "should lead" to that KB document. But for such documents, even matching the user's query text against a "dumb" tf-idf index will surface them. LLMs aren't gaining you any ground here. (As is evident by the fact that webpages SEO-optimized in this way could already be easily surfaced by old-school search engines if you typed such a query into them.)

An ideal candidate chunk/document from a re-ranking LLM's perspective, would be one that an instruction-following LLM (with the whole corpus in its context) would spit out as a response, if it were prompted with the user's query. E.g. if the user asks a question that could be answered with data, a document containing that data would rank highly. And that's exactly the kind of documents we'd like "semantic search" to surface.

replies(1): >>45651200 #

2. Valk3_ ◴[21 Oct 25 00:42 UTC] No.45651200[source]▶

>>45648506 (TP) #

I've been thinking about the problem of what to do if the answer to a question is very different to the question itself in embedding space. The KB method sounds interesting and not something I thought about, you sort work on the "document side" I guess. I've also heard of HYDE, the works on the query side, you generate hypothetical answers instead to the user query and look for documents that are similar to the answer, if I've understood it correctly.

↑