Most active commenters

tifa2up(9)
pamelafox(5)
esafak(4)
osigurdson(4)
cipherself(4)
leetharris(3)
hatmanstack(3)
pietz(3)
alansaber(3)

Popular/hot comments

>>45646303 #
>>45647377 #
>>45648705 #
>>45646532 #
>>45647637 #
>>45647032 #

Production RAG: what I learned from processing 5M+ documents

(blog.abdellatif.io)

1. manishsharan ◴[20 Oct 25 16:30 UTC] No.45645772[source]▶

>>45645349 (OP) #

Thanks for sharing. TIL about rerankers.

Chunking strategy is a big issue. I found acceptable results by shoving large texts to to gemini flash and have it summarize and extract chunks instead of whatever text splitter I tried. I use the method published by Anthropic https://www.anthropic.com/engineering/contextual-retrieval i.e. include full summary along with chunks for each embedding.

I also created a tool to enable the LLM to do vector search on its own .

I do not use Langchain or python.. I use Clojure+ LLMs' REST APIs.

replies(2): >>45645995 #>>45692691 #

2. jascha_eng ◴[20 Oct 25 16:40 UTC] No.45645905[source]▶

>>45645349 (OP) #

I have a RAG setup that doesn't work on documents but other data points that we use for generation (the original data is call recordings but it is heavily processed to just a few text chunks). Instead of a reranker model we do vector search and then simply ask GPT-5 in an extra call which of the results is the most relevant to the input question. Is there an advantage to actual reranker models rather than using a generic LLM?

replies(2): >>45645956 #>>45649058 #

3. tifa2up ◴[20 Oct 25 16:43 UTC] No.45645956[source]▶

>>45645905 #

OP here. rerankers are finetuned small models, they're cheap and very fast compared to an additional GPT-5 call.

replies(1): >>45646428 #

4. esafak ◴[20 Oct 25 16:44 UTC] No.45645965[source]▶

>>45645349 (OP) #

They say the chunker is the most important part, but theirs looks rudimentary: https://github.com/agentset-ai/agentset/blob/main/packages/e...

That is, there is nothing here that one could not easily write without a library.

replies(2): >>45646130 #>>45646451 #

5. esafak ◴[20 Oct 25 16:46 UTC] No.45645995[source]▶

>>45645772 #

Have you measured your latency, and how sensitive are you to it?

replies(1): >>45646290 #

6. alexchantavy ◴[20 Oct 25 16:47 UTC] No.45646021[source]▶

>>45645349 (OP) #

> What moved the needle: Query Generation

What does query generation mean in this context, it’s probably not SQL queries right?

replies(2): >>45646127 #>>45646638 #

7. daemonologist ◴[20 Oct 25 16:52 UTC] No.45646127[source]▶

>>45646021 #

It's described in the remainder of the point - they use an LLM to generate additional search queries, either rephrasings of the user's query or bringing additional context from the chat history.

replies(1): >>45646598 #

8. tifa2up ◴[20 Oct 25 16:52 UTC] No.45646130[source]▶

>>45645965 #

OP here. We've been working on agentset.ai full-time for 2 months. The product now gets you something working quite well out of the box. Better than most people with no experience in RAG (I'd say so with confidence).

Ingestion + Agentic Search are two areas that we're focused on in the short term.

9. nextworddev ◴[20 Oct 25 17:01 UTC] No.45646252[source]▶

>>45645349 (OP) #

Exactly what kind of processing was done? Your pipeline is a function of the use case, lest you overengineer…

10. js98 ◴[20 Oct 25 17:01 UTC] No.45646254[source]▶

>>45645349 (OP) #

Similar writeup I did about 1.5 years ago for processing millions of (technical) pages for RAG. Lots has stayed the same it seems

https://jakobs.dev/learnings-ingesting-millions-pages-rag-az...

replies(1): >>45647183 #

11. daemonologist ◴[20 Oct 25 17:02 UTC] No.45646275[source]▶

>>45645349 (OP) #

I concur:

The big LLM-based rerankers (e.g. Qwen3-reranker) are what you always wanted your cross-encoder to be, and I highly recommend giving them a try. Unfortunately they're also quite computationally expensive.

Your metadata/tabular data often contains basic facts that a human takes for granted, but which aren't repeated in every text chunk - injecting it can help a lot in making the end model seem less clueless.

The point about queries that don't work with simple RAG (like "summarize the most recent twenty documents") is very important to keep in mind. We made our UI very search-oriented and deemphasized the chat, to try to communicate to users that search is what's happening under the hood - the model only sees what you see.

replies(2): >>45646877 #>>45647595 #

12. manishsharan ◴[20 Oct 25 17:04 UTC] No.45646290{3}[source]▶

>>45645995 #

>> Have you measured your latency, and how sensitive are you to it?

Not sensitive to latency at all. My users would rather have well researched answers than poor answers.

Also, I use batch mode APIs for chunking .. it is so much cheaper.

13. leetharris ◴[20 Oct 25 17:05 UTC] No.45646303[source]▶

>>45645349 (OP) #

Embedding based RAG will always just be OK at best. It is useful for little parts of a chain or tech demos, but in real life use it will always falter.

replies(6): >>45646470 #>>45646482 #>>45646495 #>>45646758 #>>45646892 #>>45656450 #

14. jascha_eng ◴[20 Oct 25 17:14 UTC] No.45646428{3}[source]▶

>>45645956 #

It's an async process in my case (custom deep research like) so speed is not that critical

15. teraflop ◴[20 Oct 25 17:16 UTC] No.45646451[source]▶

>>45645965 #

I'm not sure there is a chunker in this repo. The file you linked certainly doesn't seem to perform any chunking, it just defines a data model for chunks.

The only place I see that actually operates on chunks does so by fetching them from Redis, and AFAICT nothing in the repo actually writes to Redis, so I assume the chunker is elsewhere.

https://github.com/agentset-ai/agentset/blob/main/packages/j...

16. sgt ◴[20 Oct 25 17:17 UTC] No.45646470[source]▶

>>45646303 #

What do you recommend? Query generation?

17. esafak ◴[20 Oct 25 17:19 UTC] No.45646482[source]▶

>>45646303 #

Compared with what?

replies(1): >>45647936 #

18. charcircuit ◴[20 Oct 25 17:19 UTC] No.45646495[source]▶

>>45646303 #

Most of my ChatGPT queries use RAG (based on the query ChatGPT will decide if it needs to search the web) to get up to date information about the world. In reality life it's effective and it's why every large provider supports it.

19. mediaman ◴[20 Oct 25 17:22 UTC] No.45646532[source]▶

>>45645349 (OP) #

The point about synthetic query generation is good. We found users had very poor queries, so we initially had the LLM generate synthetic queries. But then we found that the results could vary widely based on the specific synthetic query it generated, so we had it create three variants (all in one LLM call, so that you can prompt it to generate a wide variety, instead of getting three very similar ones back), do parallel search, and then use reciprocal rank fusion to combine the list into a set of broadly strong performers. For the searches we use hybrid dense + sparse bm25, since dense doesn't work well for technical words.

This, combined with a subsequent reranker, basically eliminated any of our issues on search.

replies(4): >>45647148 #>>45647160 #>>45647255 #>>45649007 #

20. n_u ◴[20 Oct 25 17:26 UTC] No.45646587[source]▶

>>45645349 (OP) #

> Reranking: the highest value 5 lines of code you'll add. The chunk ranking shifted a lot. More than you'd expect. Reranking can many times make up for a bad setup if you pass in enough chunks. We found the ideal reranker set-up to be 50 chunk input -> 15 output.

What is re-ranking in the context of RAG? Why not just show the code if it’s only 5 lines?

replies(1): >>45646678 #

21. goleary ◴[20 Oct 25 17:28 UTC] No.45646598{3}[source]▶

>>45646127 #

Here's an interesting read on the evolution beyond RAG: https://www.nicolasbustamante.com/p/the-rag-obituary-killed-...

One of the key features in Claude Code is "Agentic Search" aka using (rip)grep/ls to search a codebase without any of the overhead of RAG.

Sounds like even RAG approaches use a similar approach (Query Generation).

replies(1): >>45647758 #

22. andreasgl ◴[20 Oct 25 17:32 UTC] No.45646638[source]▶

>>45646021 #

I think they mean query expansion: https://en.wikipedia.org/wiki/Query_expansion

23. tifa2up ◴[20 Oct 25 17:34 UTC] No.45646678[source]▶

>>45646587 #

OP. Reranking is a specialized LLM that takes the user query, and a list of candidate results, then re-sets the order based on which ones are more relevant to the query.

Here's sample code: https://docs.cohere.com/reference/rerank

replies(1): >>45647377 #

24. 383toast ◴[20 Oct 25 17:39 UTC] No.45646734[source]▶

>>45645349 (OP) #

They should've tested other embedding models, there are better ones than openai's (and cheaper)

replies(1): >>45646823 #

25. underlines ◴[20 Oct 25 17:40 UTC] No.45646758[source]▶

>>45646303 #

rag will be pronounced differently ad again and again. it has its use cases. we moved to agentic search having rag as a tool while other retrieval strategies we added use real time search in the sources. often skipping ingested and chunked soueces. large changes next windows allow for putting almost whole documents into one request.

26. prettyblocks ◴[20 Oct 25 17:44 UTC] No.45646823[source]▶

>>45646734 #

Which do you suggest?

replies(2): >>45646987 #>>45647899 #

27. thethimble ◴[20 Oct 25 17:49 UTC] No.45646877[source]▶

>>45646275 #

I wish there was more info on the article about actual customer usage - particularly whether it improved process efficiency. It's great to focus on the technical aspects of system optimization but unless this translates to tangible business value it's all just hype.

28. phillipcarter ◴[20 Oct 25 17:50 UTC] No.45646892[source]▶

>>45646303 #

Not necessarily? It's been the basis of one of the major ways people would query their data since 2023 on a product I worked on: https://www.honeycomb.io/blog/introducing-query-assistant

The difference is this feature explicitly isn't designed to do a whole lot, which is still the best way to build most LLM-based products and sandwich it between non-LLM stuff.

29. roze_sha ◴[20 Oct 25 17:57 UTC] No.45646987{3}[source]▶

>>45646823 #

https://huggingface.co/spaces/mteb/leaderboard

replies(2): >>45647106 #>>45649118 #

30. hatmanstack ◴[20 Oct 25 18:00 UTC] No.45647032[source]▶

>>45645349 (OP) #

Not here to schlep for AWS but S3 Vectors is hands down the SOTA here. That combined with a Bedrock Knowledge Base to handle Discovery/Rebalance tasks makes for the simplest implementation on the Market.

Once Bedrock KB backed by S3 Vectors is released from Beta it'll eat everybody's lunch.

replies(3): >>45647181 #>>45650963 #>>45665062 #

31. 383toast ◴[20 Oct 25 18:06 UTC] No.45647106{4}[source]▶

>>45646987 #

yep

32. avereveard ◴[20 Oct 25 18:10 UTC] No.45647148[source]▶

>>45646532 #

final tip is to also feed the interpretation of the user search to the user on the other side, so he can check if the llm understanding was correct.

33. deepsquirrelnet ◴[20 Oct 25 18:11 UTC] No.45647160[source]▶

>>45646532 #

> For the searches we use hybrid dense + sparse bm25, since dense doesn't work well for technical words.

One thing I’m always curious about is if you could simplify this and get good/better results using SPLADE. The v3 models look really good and seem to provide a good balance of semantic and lexical retrieval.

34. arcanemachiner ◴[20 Oct 25 18:13 UTC] No.45647181[source]▶

>>45647032 #

Shill, not schlep.

I'm correcting you less out of pedantry, and more because I find the correct term to be funny.

replies(2): >>45647286 #>>45649697 #

35. winstonp ◴[20 Oct 25 18:13 UTC] No.45647183[source]▶

>>45646254 #

I also built a RAG system about a year back for technical search, everything seems the same!

36. pietz ◴[20 Oct 25 18:16 UTC] No.45647213[source]▶

>>45645349 (OP) #

I find it interesting that so many services and tools were investigated except for embedding models. I would have thought that's one of the biggest levers.

replies(2): >>45647238 #>>45656990 #

37. Trias11 ◴[20 Oct 25 18:18 UTC] No.45647238[source]▶

>>45647213 #

they just grabbed the better one (3-large) right off the bat. 6x cost to 3-small, but it's still tiny.

replies(1): >>45649644 #

38. siva7 ◴[20 Oct 25 18:20 UTC] No.45647255[source]▶

>>45646532 #

Boy, that should not be the concern of the end user (developer) but those implementing RAG solutions as a service at Amazon, Microsoft, Openai and so on.

replies(1): >>45648705 #

39. hatmanstack ◴[20 Oct 25 18:22 UTC] No.45647286{3}[source]▶

>>45647181 #

I feel like I'm schelpin' through these comments, it's all mishigas

replies(1): >>45647411 #

40. yahoozoo ◴[20 Oct 25 18:29 UTC] No.45647377{3}[source]▶

>>45646678 #

What is the difference between reranking versus generating text embeddings and comparing with cosine similarity?

replies(5): >>45647756 #>>45648506 #>>45649932 #>>45652751 #>>45655281 #

41. esafak ◴[20 Oct 25 18:31 UTC] No.45647411{4}[source]▶

>>45647286 #

You feel like a schlemiel, perhaps?

replies(1): >>45647588 #

42. hatmanstack ◴[20 Oct 25 18:46 UTC] No.45647588{5}[source]▶

>>45647411 #

more a schlimazel, Charles Schultzie, Lucy's everywhere

43. agentcoops ◴[20 Oct 25 18:46 UTC] No.45647595[source]▶

>>45646275 #

I agree completely with your point, especially the difficulty of developing the user's mental model for what's going on with context and the need to move away from chat UX. It's interesting that there are still few public examples of non-chat UIs that make context management explicit. It's possible that the big names tried this and decided it wasn't worth it -- but from comments here it seems like everyone that has built a production RAG system has come to the opposite conclusion. I'm guessing the real reason is otherwise: likely for the consumer apps controlling context (especially for free users) and inference time is one of the main levers for cost management at scale. Private RAGs, on the other hand, are more concerned with maximizing result quality and minimizing time spent by employee on a particular problem with cost per query much less of a concern --- that's been my experience at least.

44. bityard ◴[20 Oct 25 18:49 UTC] No.45647637[source]▶

>>45645349 (OP) #

I must be missing something, this says it can be self-hosted. But the first page of the self-hosting docs say you need accounts with no less than 6 (!) other third-party hosted services.

We have very different ideas about the meaning of self-hosted.

replies(4): >>45647902 #>>45648454 #>>45648644 #>>45650890 #

45. tifa2up ◴[20 Oct 25 18:58 UTC] No.45647756{4}[source]▶

>>45647377 #

text similarity finds items that closely match. Reranking my select items that are less semantically "similar" but are more relevant to the query.

46. smokel ◴[20 Oct 25 18:58 UTC] No.45647758{4}[source]▶

>>45646598 #

The article raises several interesting points, but I find its claim that Claude Code relies primarily on grep for code search unconvincing. It's clear that Claude Code can parse and reason about code structure, employing techniques far beyond simple regex matching. Since this assumption underpins much of the article's argument, it makes me question the overall reliability of its conclusions a bit.

Or am I completely misunderstanding how Claude Code works?

47. dcreater ◴[20 Oct 25 19:03 UTC] No.45647829[source]▶

>>45645349 (OP) #

do you still use langchain/llamaindex for other agents/AI use cases?

48. leftnode ◴[20 Oct 25 19:10 UTC] No.45647899{3}[source]▶

>>45646823 #

The Qwen3 600M and 4B embedding models are near state of the art and aren't too computationally intensive.

49. goodev ◴[20 Oct 25 19:10 UTC] No.45647902[source]▶

>>45647637 #

I consider this to be good open source and I'm a happy user of their OSS offering. Want no hosted dependencies? Then go write it all in Rust.

replies(1): >>45650200 #

50. leetharris ◴[20 Oct 25 19:13 UTC] No.45647936{3}[source]▶

>>45646482 #

Full text agentic retrieval. Instead of cosine similarity on vectors, parsing metadata through an agentic loop.

To give a real world example, the way Claude Code works versus how Cursor's embedded database works.

replies(1): >>45648797 #

51. _pvzn ◴[20 Oct 25 19:45 UTC] No.45648339[source]▶

>>45645349 (OP) #

Really solid write-up — it’s rare to see someone break down the real tradeoffs of scaling RAG beyond the toy examples. The bit about reranking and chunking actually saving more than fancy LLM tricks hits home to me.

52. dgfitz ◴[20 Oct 25 19:54 UTC] No.45648454[source]▶

>>45647637 #

I’ve never worked in such a space where the deployed environment had unfettered internet access, no access at all actually.

I’ve probably missed a huge wave of programming technology because of this, and I’ve figured out a way to make it work for a consistent paycheck over these past 20 years.

I’m also not a great example, I think I’ve watched 7 whole hours of YouTube videos ever, and those were all for car repair help.

I shy away from tech that needs to be online/connected/whatever.

53. derefr ◴[20 Oct 25 19:57 UTC] No.45648506{4}[source]▶

>>45647377 #

My understanding:

If you generate embeddings (of the query, and of the candidate documents) and compare them for similarity, you're essentially asking whether the documents "look like the question."

If you get an LLM to evaluate how well each candidate document follows from the query, you're asking whether the documents "look like an answer to the question."

An ideal candidate chunk/document from a cosine-similarity perspective, would be one that perfectly restates what the user said — whether or not that document actually helps the user. Which can be made to work, if you're e.g. indexing a knowledge base where every KB document is SEO-optimized to embed all pertinent questions a user might ask that "should lead" to that KB document. But for such documents, even matching the user's query text against a "dumb" tf-idf index will surface them. LLMs aren't gaining you any ground here. (As is evident by the fact that webpages SEO-optimized in this way could already be easily surfaced by old-school search engines if you typed such a query into them.)

An ideal candidate chunk/document from a re-ranking LLM's perspective, would be one that an instruction-following LLM (with the whole corpus in its context) would spit out as a response, if it were prompted with the user's query. E.g. if the user asks a question that could be answered with data, a document containing that data would rank highly. And that's exactly the kind of documents we'd like "semantic search" to surface.

replies(1): >>45651200 #

54. RobertDeNiro ◴[20 Oct 25 20:08 UTC] No.45648644[source]▶

>>45647637 #

That was my observation as well. To be fair their business is to sell a hosted version, they’re under no obligation to release a truly self hosted version.

55. pamelafox ◴[20 Oct 25 20:13 UTC] No.45648705{3}[source]▶

>>45647255 #

At Microsoft, that's all baked into Azure AI Search - hybrid search does BM25, vector search, and re-ranking, just with setting booleans to true. It also has a new Agentic retrieval feature that does the query rewriting and parallel search execution.

Disclosure: I work at MS and help maintain our most popular open-source RAG template, so I follow the best practices closely: https://github.com/Azure-Samples/azure-search-openai-demo/

So few developers realize that you need more than just vector search, so I still spend many of my talks emphasizing the FULL retrieval stack for RAG. It's also possible to do it on top of other DBs like Postgres, but takes more effort.

replies(5): >>45648904 #>>45648985 #>>45649659 #>>45650931 #>>45654119 #

56. lifty ◴[20 Oct 25 20:20 UTC] No.45648797{4}[source]▶

>>45647936 #

How do you do that on 5 million documents?

replies(1): >>45655545 #

57. catmanjan ◴[20 Oct 25 20:30 UTC] No.45648904{4}[source]▶

>>45648705 #

I'd love to work with Azure search but because copilot with external items has been made so cheap it's hard to justify...

replies(1): >>45649184 #

58. alansaber ◴[20 Oct 25 20:36 UTC] No.45648985{4}[source]▶

>>45648705 #

That is concerning given that pure vector search is terrible outside of abstractions

replies(1): >>45649204 #

59. alansaber ◴[20 Oct 25 20:38 UTC] No.45649007[source]▶

>>45646532 #

Yep- that's all best practice. I want to know if we could push performance further- routing the query to different embedding models or scoring strategies, or using multiple re-rankers- still feels like the process is missing something.

replies(1): >>45653469 #

60. alansaber ◴[20 Oct 25 20:41 UTC] No.45649058[source]▶

>>45645905 #

I think you should do both in parallel, rather than sequentially. Main reason is vector scoring could cut off something that an LLM will score as relevant

61. remz14 ◴[20 Oct 25 20:45 UTC] No.45649118{4}[source]▶

>>45646987 #

You should use RTEB instead. See here for why: https://huggingface.co/blog/rteb

Here is that leaderboard https://huggingface.co/spaces/mteb/leaderboard?benchmark_nam...

Voyage-3-large seems like SOTA right now

62. pamelafox ◴[20 Oct 25 20:50 UTC] No.45649184{5}[source]▶

>>45648904 #

Do you mean that you're using the Copilot indexer for Sharepoint docs? https://learn.microsoft.com/en-us/microsoftsearch/semantic-i...

AI Search team's been working with the Sharepoint team to offer more options, so that devs can get best of both worlds. Might have some stuff ready for Ignite (mid November).

63. pamelafox ◴[20 Oct 25 20:52 UTC] No.45649204{5}[source]▶

>>45648985 #

I know :( But I think vector DBs and vector search got so hyped that people thought you could switch entirely over to them. Lots of APIs and frameworks also used "vector store" as the shorthand for "retrieval data source", which didn't help.

That's why I write blog posts like https://blog.pamelafox.org/2024/06/vector-search-is-not-enou...

replies(1): >>45649695 #

64. whinvik ◴[20 Oct 25 20:54 UTC] No.45649234[source]▶

>>45645349 (OP) #

Anybody know what is meant by 'injecting relevant metadata'. Where is it injected?

replies(1): >>45653486 #

65. inshard ◴[20 Oct 25 20:59 UTC] No.45649309[source]▶

>>45645349 (OP) #

Nice app bro https://usul.ai/chat/VgnzXjlRdljIDMBVCfqiy

66. pietz ◴[20 Oct 25 21:31 UTC] No.45649644{3}[source]▶

>>45647238 #

But the model is like 18 months old. and recently we've seen big leaps on MTEB. Not sure how well those translate to reality, but I'm a little surpised this wasn't worth looking into.

67. osigurdson ◴[20 Oct 25 21:33 UTC] No.45649659{4}[source]▶

>>45648705 #

Are you using Elasticsearch behind the scenes?

replies(1): >>45649840 #

68. osigurdson ◴[20 Oct 25 21:35 UTC] No.45649695{6}[source]▶

>>45649204 #

It is almost like embeddings are a technology from the olden days.

69. latchkey ◴[20 Oct 25 21:35 UTC] No.45649697{3}[source]▶

>>45647181 #

Especially now that if you google the word schlep, the first result is now something totally different than what you'd expect.

70. pamelafox ◴[20 Oct 25 21:47 UTC] No.45649840{5}[source]▶

>>45649659 #

I believe that Azure AI Search currently uses lucene for BM25, hnswlib for vector search, and the Bing re-ranking model for semantic ranking. (So, no, it does not, though features are similar)

71. osigurdson ◴[20 Oct 25 21:56 UTC] No.45649932{4}[source]▶

>>45647377 #

Because LLMs are a lot smarter than embeddings and basic math. Think of the vector / lexical search as the first approximation.

72. osigurdson ◴[20 Oct 25 22:05 UTC] No.45650021[source]▶

>>45645349 (OP) #

Speaking of embedding models, OpenAIs are getting a little long in the tooth at this stage.

73. icemanx ◴[20 Oct 25 22:24 UTC] No.45650200{3}[source]▶

>>45647902 #

that's a stupid take and shows lack of engineering experience

74. nl ◴[20 Oct 25 23:58 UTC] No.45650890[source]▶

>>45647637 #

You can self-host their code. I don't think there is any official definition of "self hosted" that this violates.

For example - if a "self hosted" service supports off-site backups is it self hosted or just well designed?

replies(2): >>45651207 #>>45653185 #

75. cipherself ◴[21 Oct 25 00:04 UTC] No.45650931{4}[source]▶

>>45648705 #

I am working on search but rather for text-to-image retrieval, nevertheless, I am curious if by that's all baked into Azure AI search you also meant synthetic query generation from the grandparent comment. If so, what's your latency for this? And do you extract structured data from the query? If so, do you use LLMs for that?

Moreover I am curious why you guys use bm25 over SPLADE?

replies(1): >>45653068 #

76. cipherself ◴[21 Oct 25 00:08 UTC] No.45650963[source]▶

>>45647032 #

S3 Vectors is hands down the SOTA here

SOTA for what? Isn't it just a vector store?

replies(1): >>45656437 #

77. mattfrommars ◴[21 Oct 25 00:17 UTC] No.45651029[source]▶

>>45645349 (OP) #

Great read. But how do people land opportunities to work on exciting project as the author did? I've been trying to get into legal tech in LLM space but I've been unsuccessful.

Anyone here successfully transitioned into legal space? My gut always been legal to the space where LLM can really be useful, the first one is in programming.

78. torrmal ◴[21 Oct 25 00:28 UTC] No.45651090[source]▶

>>45645349 (OP) #

we have been trying to make it so that people dont have to reinvent the wheel, over and over and over again, and have a very straight forward all batteries included that can scale to many millions of documents, combining the best of RAG with traditional search and parametric search, https://docs.mindsdb.com/mindsdb_sql/knowledge_bases/overvie... Would love your feedback.

79. Valk3_ ◴[21 Oct 25 00:42 UTC] No.45651200{5}[source]▶

>>45648506 #

I've been thinking about the problem of what to do if the answer to a question is very different to the question itself in embedding space. The KB method sounds interesting and not something I thought about, you sort work on the "document side" I guess. I've also heard of HYDE, the works on the query side, you generate hypothetical answers instead to the user query and look for documents that are similar to the answer, if I've understood it correctly.

80. taneq ◴[21 Oct 25 00:44 UTC] No.45651207{3}[source]▶

>>45650890 #

In that case I’m self hosting every web page on the internet because I installed Firefox.

81. jweewee ◴[21 Oct 25 01:18 UTC] No.45651432[source]▶

>>45645349 (OP) #

Does anyone know how to do versioning for embeddings? Let’s say I want to update/upsert my data and deliver v6 of domain data instead of v1 or filter for data within a specified date range. I am thinking of exploring context prepending to chunks.

replies(2): >>45651509 #>>45652174 #

82. bob_theslob646 ◴[21 Oct 25 01:33 UTC] No.45651509[source]▶

>>45651432 #

This is a great question

83. captainregex ◴[21 Oct 25 03:31 UTC] No.45652171[source]▶

>>45645349 (OP) #

How much of a hit would you take on quality if you moved the processing local? have you experimented with it? don’t think llamaindex has local sadly

replies(1): >>45653478 #

84. meander_water ◴[21 Oct 25 03:31 UTC] No.45652174[source]▶

>>45651432 #

Your vector store should let you store the original text as well as metadata, where you can store the version. For e.g. turbopuffer lets you filter on attributes https://turbopuffer.com/docs/query#filtering

85. liqilin1567 ◴[21 Oct 25 03:46 UTC] No.45652271[source]▶

>>45645349 (OP) #

> Chunking Strategy: this takes a lot of effort, you'll probably be spending most of your time on it

Could you share more about chunking strategies you used?

86. hawthorns ◴[21 Oct 25 05:29 UTC] No.45652751{4}[source]▶

>>45647377 #

The main point didn't get hit on by the responses. Re-ranking is just a mini-LLM (for latency/cost reasons) that does a double heck. Embedding model finds the closest M documents in R^N space. Re-ranker picks the top K documents from the M documents. In theory, if we just used Gemini 2.5 Pro or GPT 5 as the re-ranker, the performance would even be better than whatever small re-ranker people choose to use.

87. pamelafox ◴[21 Oct 25 06:35 UTC] No.45653068{5}[source]▶

>>45650931 #

Yes, AI Search has a new agentic retrieval feature that includes synthetic query generation: https://techcommunity.microsoft.com/blog/azure-ai-foundry-bl... You can customize the model used and the max # of queries to generate, so latency depends on those factors, plus the length of the conversation history passed in. The model is usually gpt-4o or gpt-4.1 or the -mini of those, so it's the standard latency for those. A more recent version of that feature also uses the LLM to dynamically decide which of several indices to query, and executes the searches in parallel.

That query generation approach does not extract structured data. I do maintain another RAG template for PostgreSQL that uses function calling to turn the query into a structured query, such that I can construct SQL filters dynamically. Docs here: https://github.com/Azure-Samples/rag-postgres-openai-python/...

I'll ask the search about SPLADE, not sure.

replies(1): >>45654306 #

88. kkapelon ◴[21 Oct 25 06:59 UTC] No.45653185{3}[source]▶

>>45650890 #

> For example - if a "self hosted" service supports off-site backups is it self hosted or just well designed?

There is a big difference between communicating with external services (your example) vs REQUIRING external services (what parent is complaining about).

If in your example the system can run correctly with just local backups I would consider it self-hosted.

89. swyx ◴[21 Oct 25 07:21 UTC] No.45653321[source]▶

>>45645349 (OP) #

> LLM: GPT 4.1 -> GPT 5 -> GPT 4.1, covered by Azure credits

whats this roundtrip? also the chronology of the LLM (4.1) doesnt match the rest of the stack (text-embedding-large-3), feels weird

replies(1): >>45653457 #

90. urbandw311er ◴[21 Oct 25 07:38 UTC] No.45653406[source]▶

>>45645349 (OP) #

To somebody thinking of building or paying for such a RAG system, would a workable solution be:

* Upload documents via API into a Google Workspace folder * Use some sort of Google AI search API on those documents in that folder

…placing documents for different customers into different folders.

Or the Azure equivalent whatever that is.

91. tifa2up ◴[21 Oct 25 07:47 UTC] No.45653457[source]▶

>>45653321 #

OP. We migrated to GPT-5 when it came out but found that it performs worse than 4.1 when you pass lots of context (up to 100K tokens in some cases). We found that it:

a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error

Again, these were only observed in RAG when you pass lots of chunks, GPT-5 is probably a better model for other taks.

replies(1): >>45662269 #

92. tifa2up ◴[21 Oct 25 07:49 UTC] No.45653469{3}[source]▶

>>45649007 #

OP. The way you improve it is move away from single shot semantic/keyword search and have an agentic system that can evaluate results and do follow-up queries.

93. tifa2up ◴[21 Oct 25 07:51 UTC] No.45653478[source]▶

>>45652171 #

Quite a decent hit. Local models don't perform very well in long contexts. We're planning to support a local-only offline set-up for people to host w/o additional dependencies

94. tifa2up ◴[21 Oct 25 07:53 UTC] No.45653486[source]▶

>>45649234 #

You typically add a lot of metadata with each chunk text to be able to filter it, and do to include in the citations. Injecting metadata means that you see what metadata adds helpful context to the LLM, and when you pass the results to the LLM you pass them in a format like this:

Title: ... Author: ... Text: ...

for each chunk, instead of just passing the text

95. max002 ◴[21 Oct 25 08:52 UTC] No.45653801[source]▶

>>45645349 (OP) #

Great post, gonna be super useful for me :)

96. pietz ◴[21 Oct 25 09:55 UTC] No.45654114[source]▶

>>45645349 (OP) #

My biggest RAG learning is to use agentic RAG. (Sorry for buzzword dropping)

- Classic RAG: `User -> Search -> LLM -> User`

- Agentic RAG: `User <-> LLM <-> Search`

Essentially instead of having a fixed loop, you provide the search as a tool to the LLM, which does three things:

- The LLM can search multiple times

- The LLM can adjust the search query

- The LLM can use multiple tools

The combination of these three things has solved a majority of classic RAG problems. It improves user queries, it can map abbreviations, it can correct bad results on its own, you can also let it list directories and load files directly.

replies(2): >>45656209 #>>45656944 #

97. jankovicsandras ◴[21 Oct 25 09:56 UTC] No.45654119{4}[source]▶

>>45648705 #

"It's also possible to do it on top of other DBs like Postgres, but takes more effort."

Shameless plug: plpgsql_bm25: BM25 search implemented in PL/pgSQL (The Unlicense / PUBLIC DOMAIN)

https://github.com/jankovicsandras/plpgsql_bm25

There's an example Postgres_hybrid_search_RRF.ipynb in the repo which shows hybrid search with Reciprocal Rank Fusion ( plpgsql_bm25 + pgvector ).

98. cipherself ◴[21 Oct 25 10:32 UTC] No.45654306{6}[source]▶

>>45653068 #

Got it, I think this might make sense for a "conversation" type of search not for an instant search feature because lowest latency is gonna be too high IMO.

replies(1): >>45657075 #

99. PunchTornado ◴[21 Oct 25 13:06 UTC] No.45655281{4}[source]▶

>>45647377 #

the reranker is a cross encoder that sees the docs and the query at the same time. What you normally do is you generating embeddings ahead of time, independent of the prompt used, calculate cosine similarity with the prompt, select the top-k best chunks that match the prompt and only then use a reranker to sort them.

embeddings are a lossy compression, so if you feed the chunks with the prompt at the same time, the results are better. But you can't do this for your whole db, that's why the filtering with cosine similarity at the beginning.

100. leetharris ◴[21 Oct 25 13:29 UTC] No.45655545{5}[source]▶

>>45648797 #

People are usually not querying across 5 million documents in a single scope.

If you want something as simple as "suggest similar tweets" or something across millions of things then embeddings still work.

But if you want something like "compare the documents across these three projects" then you would use full text metadata extraction. Keywords, summaries, table of contents, etc to determine data about each document and each chunk.

101. googamooga ◴[21 Oct 25 14:23 UTC] No.45656209[source]▶

>>45654114 #

I fully support this approach! When I first started experimenting—rather naively—with using tool-enabled LLMs to generate documents (such as reports or ADRs) from the extensive knowledge base in Confluence, I built a few tools to help the LLM search Confluence using CQL (Confluence Query Language) and store the retrieved pages in a dedicated folder. The LLM could then search within that folder with simple filesystem tools and pull entire files into its context as needed. The results were quite good, as long as the context didn’t become overloaded. However, when I later tried to switch to a 'Classic RAG' setup, the output quality dropped significantly and I refrained from switching.

102. DSingularity ◴[21 Oct 25 14:42 UTC] No.45656437{3}[source]▶

>>45650963 #

I think he just means it should be assumed to be standard practice and considered baseline at this point.

replies(1): >>45661789 #

103. DSingularity ◴[21 Oct 25 14:43 UTC] No.45656450[source]▶

>>45646303 #

Super useful for grounding which is often the only way to robustly protect against hallucinations.

104. jokethrowaway ◴[21 Oct 25 15:21 UTC] No.45656944[source]▶

>>45654114 #

yes but the assistant often doesn't search when it should and very rarely does multiple search rounds (both on gpt5 or on claude sonnet 4.5, weaker models are even worse at tool calling)

105. jokethrowaway ◴[21 Oct 25 15:26 UTC] No.45656990[source]▶

>>45647213 #

i'd go with qwen embedding 3, gemini embeddings or something from mixedbread

106. pmc00 ◴[21 Oct 25 15:33 UTC] No.45657075{7}[source]▶

>>45654306 #

Fair point on latency, we (Azure AI Search) target both scenarios with different features. For instant search you can just do the usual hybrid + rerank combo, or if you want query rewriting to improve user queries, you can enable QR at a moderate latency hit. We evaluated this approach at length here: https://techcommunity.microsoft.com/blog/azure-ai-foundry-bl...

Of course, agentic retrieval is just better quality-wise for a broader set of scenarios, usual quality-latency trade-off.

We don't do SPLADE today. We've explored it and may get back to it at some point, but we ended up investing more on reranking to boost precision, we've found we have fewer challenges on the recall side.

107. badlogic ◴[21 Oct 25 16:00 UTC] No.45657416[source]▶

>>45645349 (OP) #

I run a few production RAG systems, some as old as end of 2023 and arrived at the same conclusions.

Query expansions and non-naive chunking give the biggest bang for the bug, with chunking being the most resource intensive task, if the input data is chunk (pun intended).

108. cipherself ◴[21 Oct 25 21:13 UTC] No.45661789{4}[source]▶

>>45656437 #

Assuming that's what he meant, why would it be considered baseline versus anything else? I am genuinely curious because I'd like to know more about issues people face with this or that vector store in general.

109. swyx ◴[21 Oct 25 22:00 UTC] No.45662269{3}[source]▶

>>45653457 #

love the share, ty

110. sangwook ◴[22 Oct 25 05:04 UTC] No.45665062[source]▶

>>45647032 #

S3 Vectors is great in terms of cost. But it provides around 500ms median query latency for 1M vectors, unlike other vector stores. And it does not support keyword search and sparse vectors. So I think it is better to choose which vector store to use based on your requirements.

111. crassT ◴[24 Oct 25 09:35 UTC] No.45692691[source]▶

>>45645772 #

I made a startup, https://tokencrush.ai/, to do just this.

I've struggled to find a target market though. Would you mind sharing what your use case is? It would really help give me some direction.

↑