Production RAG: what I learned from processing 5M+ documents

(blog.abdellatif.io)

548 points tifa2up | 2 comments | 20 Oct 25 15:55 UTC | HN request time: 0.435s | source

Show context

manishsharan ◴[20 Oct 25 16:30 UTC] No.45645772[source]▶

Thanks for sharing. TIL about rerankers.

Chunking strategy is a big issue. I found acceptable results by shoving large texts to to gemini flash and have it summarize and extract chunks instead of whatever text splitter I tried. I use the method published by Anthropic https://www.anthropic.com/engineering/contextual-retrieval i.e. include full summary along with chunks for each embedding.

I also created a tool to enable the LLM to do vector search on its own .

I do not use Langchain or python.. I use Clojure+ LLMs' REST APIs.

replies(2): >>45645995 #>>45692691 #

1. esafak ◴[20 Oct 25 16:46 UTC] No.45645995[source]▶

>>45645772 #

Have you measured your latency, and how sensitive are you to it?

replies(1): >>45646290 #

2. manishsharan ◴[20 Oct 25 17:04 UTC] No.45646290[source]▶

>>45645995 (TP) #

>> Have you measured your latency, and how sensitive are you to it?

Not sensitive to latency at all. My users would rather have well researched answers than poor answers.

Also, I use batch mode APIs for chunking .. it is so much cheaper.

↑