Production RAG: what I learned from processing 5M+ documents

1. mediaman ◴[20 Oct 25 17:22 UTC] No.45646532[source]▶

The point about synthetic query generation is good. We found users had very poor queries, so we initially had the LLM generate synthetic queries. But then we found that the results could vary widely based on the specific synthetic query it generated, so we had it create three variants (all in one LLM call, so that you can prompt it to generate a wide variety, instead of getting three very similar ones back), do parallel search, and then use reciprocal rank fusion to combine the list into a set of broadly strong performers. For the searches we use hybrid dense + sparse bm25, since dense doesn't work well for technical words.

This, combined with a subsequent reranker, basically eliminated any of our issues on search.

replies(4): >>45647148 #>>45647160 #>>45647255 #>>45649007 #

2. avereveard ◴[20 Oct 25 18:10 UTC] No.45647148[source]▶

>>45646532 (TP) #

final tip is to also feed the interpretation of the user search to the user on the other side, so he can check if the llm understanding was correct.

3. deepsquirrelnet ◴[20 Oct 25 18:11 UTC] No.45647160[source]▶

>>45646532 (TP) #

> For the searches we use hybrid dense + sparse bm25, since dense doesn't work well for technical words.

One thing I’m always curious about is if you could simplify this and get good/better results using SPLADE. The v3 models look really good and seem to provide a good balance of semantic and lexical retrieval.

4. siva7 ◴[20 Oct 25 18:20 UTC] No.45647255[source]▶

>>45646532 (TP) #

Boy, that should not be the concern of the end user (developer) but those implementing RAG solutions as a service at Amazon, Microsoft, Openai and so on.

replies(1): >>45648705 #

5. pamelafox ◴[20 Oct 25 20:13 UTC] No.45648705[source]▶

>>45647255 #

At Microsoft, that's all baked into Azure AI Search - hybrid search does BM25, vector search, and re-ranking, just with setting booleans to true. It also has a new Agentic retrieval feature that does the query rewriting and parallel search execution.

Disclosure: I work at MS and help maintain our most popular open-source RAG template, so I follow the best practices closely: https://github.com/Azure-Samples/azure-search-openai-demo/

So few developers realize that you need more than just vector search, so I still spend many of my talks emphasizing the FULL retrieval stack for RAG. It's also possible to do it on top of other DBs like Postgres, but takes more effort.

replies(5): >>45648904 #>>45648985 #>>45649659 #>>45650931 #>>45654119 #

6. catmanjan ◴[20 Oct 25 20:30 UTC] No.45648904{3}[source]▶

>>45648705 #

I'd love to work with Azure search but because copilot with external items has been made so cheap it's hard to justify...

replies(1): >>45649184 #

7. alansaber ◴[20 Oct 25 20:36 UTC] No.45648985{3}[source]▶

>>45648705 #

That is concerning given that pure vector search is terrible outside of abstractions

replies(1): >>45649204 #

8. alansaber ◴[20 Oct 25 20:38 UTC] No.45649007[source]▶

>>45646532 (TP) #

Yep- that's all best practice. I want to know if we could push performance further- routing the query to different embedding models or scoring strategies, or using multiple re-rankers- still feels like the process is missing something.

replies(1): >>45653469 #

9. pamelafox ◴[20 Oct 25 20:50 UTC] No.45649184{4}[source]▶

>>45648904 #

Do you mean that you're using the Copilot indexer for Sharepoint docs? https://learn.microsoft.com/en-us/microsoftsearch/semantic-i...

AI Search team's been working with the Sharepoint team to offer more options, so that devs can get best of both worlds. Might have some stuff ready for Ignite (mid November).

10. pamelafox ◴[20 Oct 25 20:52 UTC] No.45649204{4}[source]▶

>>45648985 #

I know :( But I think vector DBs and vector search got so hyped that people thought you could switch entirely over to them. Lots of APIs and frameworks also used "vector store" as the shorthand for "retrieval data source", which didn't help.

That's why I write blog posts like https://blog.pamelafox.org/2024/06/vector-search-is-not-enou...

replies(1): >>45649695 #

11. osigurdson ◴[20 Oct 25 21:33 UTC] No.45649659{3}[source]▶

>>45648705 #

Are you using Elasticsearch behind the scenes?

replies(1): >>45649840 #

12. osigurdson ◴[20 Oct 25 21:35 UTC] No.45649695{5}[source]▶

>>45649204 #

It is almost like embeddings are a technology from the olden days.

13. pamelafox ◴[20 Oct 25 21:47 UTC] No.45649840{4}[source]▶

>>45649659 #

I believe that Azure AI Search currently uses lucene for BM25, hnswlib for vector search, and the Bing re-ranking model for semantic ranking. (So, no, it does not, though features are similar)

14. cipherself ◴[21 Oct 25 00:04 UTC] No.45650931{3}[source]▶

>>45648705 #

I am working on search but rather for text-to-image retrieval, nevertheless, I am curious if by that's all baked into Azure AI search you also meant synthetic query generation from the grandparent comment. If so, what's your latency for this? And do you extract structured data from the query? If so, do you use LLMs for that?

Moreover I am curious why you guys use bm25 over SPLADE?

replies(1): >>45653068 #

15. pamelafox ◴[21 Oct 25 06:35 UTC] No.45653068{4}[source]▶

>>45650931 #

Yes, AI Search has a new agentic retrieval feature that includes synthetic query generation: https://techcommunity.microsoft.com/blog/azure-ai-foundry-bl... You can customize the model used and the max # of queries to generate, so latency depends on those factors, plus the length of the conversation history passed in. The model is usually gpt-4o or gpt-4.1 or the -mini of those, so it's the standard latency for those. A more recent version of that feature also uses the LLM to dynamically decide which of several indices to query, and executes the searches in parallel.

That query generation approach does not extract structured data. I do maintain another RAG template for PostgreSQL that uses function calling to turn the query into a structured query, such that I can construct SQL filters dynamically. Docs here: https://github.com/Azure-Samples/rag-postgres-openai-python/...

I'll ask the search about SPLADE, not sure.

replies(1): >>45654306 #

16. tifa2up ◴[21 Oct 25 07:49 UTC] No.45653469[source]▶

>>45649007 #

OP. The way you improve it is move away from single shot semantic/keyword search and have an agentic system that can evaluate results and do follow-up queries.

17. jankovicsandras ◴[21 Oct 25 09:56 UTC] No.45654119{3}[source]▶

>>45648705 #

"It's also possible to do it on top of other DBs like Postgres, but takes more effort."

Shameless plug: plpgsql_bm25: BM25 search implemented in PL/pgSQL (The Unlicense / PUBLIC DOMAIN)

https://github.com/jankovicsandras/plpgsql_bm25

There's an example Postgres_hybrid_search_RRF.ipynb in the repo which shows hybrid search with Reciprocal Rank Fusion ( plpgsql_bm25 + pgvector ).

18. cipherself ◴[21 Oct 25 10:32 UTC] No.45654306{5}[source]▶

>>45653068 #

Got it, I think this might make sense for a "conversation" type of search not for an instant search feature because lowest latency is gonna be too high IMO.

replies(1): >>45657075 #

19. pmc00 ◴[21 Oct 25 15:33 UTC] No.45657075{6}[source]▶

>>45654306 #

Fair point on latency, we (Azure AI Search) target both scenarios with different features. For instant search you can just do the usual hybrid + rerank combo, or if you want query rewriting to improve user queries, you can enable QR at a moderate latency hit. We evaluated this approach at length here: https://techcommunity.microsoft.com/blog/azure-ai-foundry-bl...

Of course, agentic retrieval is just better quality-wise for a broader set of scenarios, usual quality-latency trade-off.

We don't do SPLADE today. We've explored it and may get back to it at some point, but we ended up investing more on reranking to boost precision, we've found we have fewer challenges on the recall side.