Most active commenters
  • sdesol(4)

←back to thread

446 points liukidar | 24 comments | | HN request time: 1.877s | source | bottom

Hey there HN! We’re Antonio, Luca, and Yuhang, and we’re excited to introduce Fast GraphRAG, an open-source RAG approach that leverages knowledge graphs and the 25 years old PageRank for better information retrieval and reasoning.

Building a good RAG pipeline these days takes a lot of manual optimizations. Most engineers intuitively start from naive RAG: throw everything in a vector database and hope that semantic search is powerful enough. This can work for use cases where accuracy isn’t too important and hallucinations are tolerable, but it doesn’t work for more difficult queries that involve multi-hop reasoning or more advanced domain understanding. Also, it’s impossible to debug it.

To address these limitations, many engineers find themselves adding extra layers like agent-based preprocessing, custom embeddings, reranking mechanisms, and hybrid search strategies. Much like the early days of machine learning when we manually crafted feature vectors to squeeze out marginal gains, building an effective RAG system often becomes an exercise in crafting engineering “hacks.”

Earlier this year, Microsoft seeded the idea of using Knowledge Graphs for RAG and published GraphRAG - i.e. RAG with Knowledge Graphs. We believe that there is an incredible potential in this idea, but existing implementations are naive in the way they create and explore the graph. That’s why we developed Fast GraphRAG with a new algorithmic approach using good old PageRank.

There are two main challenges when building a reliable RAG system:

(1) Data Noise: Real-world data is often messy. Customer support tickets, chat logs, and other conversational data can include a lot of irrelevant information. If you push noisy data into a vector database, you’re likely to get noisy results.

(2) Domain Specialization: For complex use cases, a RAG system must understand the domain-specific context. This requires creating representations that capture not just the words but the deeper relationships and structures within the data.

Our solution builds on these insights by incorporating knowledge graphs into the RAG pipeline. Knowledge graphs store entities and their relationships, and can help structure data in a way that enables more accurate and context-aware information retrieval. 12 years ago Google announced the knowledge graph we all know about [1]. It was a pioneering move. Now we have LLMs, meaning that people can finally do RAG on their own data with tools that can be as powerful as Google’s original idea.

Before we built this, Antonio was at Amazon, while Luca and Yuhang were finishing their PhDs at Oxford. We had been thinking about this problem for years and we always loved the parallel between pagerank and the human memory [2]. We believe that searching for memories is incredibly similar to searching the web.

Here’s how it works:

- Entity and Relationship Extraction: Fast GraphRAG uses LLMs to extract entities and their relationships from your data and stores them in a graph format [3].

- Query Processing: When you make a query, Fast GraphRAG starts by finding the most relevant entities using vector search, then runs a personalized PageRank algorithm to determine the most important “memories” or pieces of information related to the query [4].

- Incremental Updates: Unlike other graph-based RAG systems, Fast GraphRAG natively supports incremental data insertions. This means you can continuously add new data without reprocessing the entire graph.

- Faster: These design choices make our algorithm faster and more affordable to run than other graph-based RAG systems because we eliminate the need for communities and clustering.

Suppose you’re analyzing a book and want to focus on character interactions, locations, and significant events:

  from fast_graphrag import GraphRAG
  
  DOMAIN = "Analyze this story and identify the characters. Focus on how they interact with each other, the locations they explore, and their relationships."
  
  EXAMPLE_QUERIES = [
      "What is the significance of Christmas Eve in A Christmas Carol?",
      "How does the setting of Victorian London contribute to the story's themes?",
      "Describe the chain of events that leads to Scrooge's transformation.",
      "How does Dickens use the different spirits (Past, Present, and Future) to guide Scrooge?",
      "Why does Dickens choose to divide the story into \"staves\" rather than chapters?"
  ]
  
  ENTITY_TYPES = ["Character", "Animal", "Place", "Object", "Activity", "Event"]
  
  grag = GraphRAG(
      working_dir="./book_example",
      domain=DOMAIN,
      example_queries="\n".join(EXAMPLE_QUERIES),
      entity_types=ENTITY_TYPES
  )
  
  with open("./book.txt") as f:
      grag.insert(f.read())
  
  print(grag.query("Who is Scrooge?").response)
This code creates a domain-specific knowledge graph based on your data, example queries, and specified entity types. Then you can query it in plain English while it automatically handles all the data fetching, entity extractions, co-reference resolutions, memory elections, etc. When you add new data, locking and checkpointing is handled for you as well.

This is the kind of infrastructure that GenAI apps need to handle large-scale real-world data. Our goal is to give you this infrastructure so that you can focus on what’s important: building great apps for your users without having to care about manually engineering a retrieval pipeline. In the managed service, we also have a suite of UI tools for you to explore and debug your knowledge graph.

We have a free hosted solution with up to 100 monthly requests. When you’re ready to grow, we have paid plans that scale with you. And of course you can self host our open-source engine.

Give us a spin today at https://circlemind.co and see our code at https://github.com/circlemind-ai/fast-graphrag

We’d love feedback :)

[1] https://blog.google/products/search/introducing-knowledge-gr...

[2] Griffiths, T. L., Steyvers, M., & Firl, A. (2007). Google and the Mind: Predicting Fluency with PageRank. Psychological Science, 18(12), 1069–1076. http://www.jstor.org/stable/40064705

[3] Similarly to Microsoft’s GraphRAG: https://github.com/microsoft/graphrag

[4] Similarly to OSU’s HippoRAG: https://github.com/OSU-NLP-Group/HippoRAG

https://vhs.charm.sh/vhs-4fCicgsbsc7UX0pemOcsMp.gif

1. LASR ◴[] No.42177909[source]
So I've done a ton of work in this area.

Few learnings I've collected:

1. Lexical search with BM25 alone gives you very relevant results if you can do some work during ingestion time with an LLM.

2. Embeddings work well only when the size of the query is roughly on the same order of what you're actually storing in the embedding store.

3. Hypothetical answer generation from a query using an LLM, and then using that hypothetical answer to query for embeddings works really well.

So combining all 3 learnings, we landed on a knowledge decomposition and extraction step very similar to yours. But we stick a metaprompter to essentially auto-generate the domain / entity types.

LLMs are naively bad at identifying the correct level of granularity for the decomposed knowledge. One trick we found is to ask the LLM to output a mermaid.js mindmap to hierarchically break down the input into a tree. At the end of that output, ask the LLM to state which level is the appropriate root for a knowledge node.

Then the node is used to generate questions that could be answered from the knowledge contained in this node. We then index the text of these questions and also embed them.

You can directly match the user's query from these questions using purely BM25 and get good outputs. But a hybrid approach works even better, though not by that much.

Not using LLMs are query time also means we can hierarchically walk down the root into deeper and deeper nodes, using the embedding similiarity as a cost function for the traversal.

replies(12): >>42178169 #>>42178206 #>>42178645 #>>42178703 #>>42179361 #>>42179704 #>>42183748 #>>42184367 #>>42185058 #>>42185435 #>>42186316 #>>42193843 #
2. sramam ◴[] No.42178169[source]
Very interesting. Thank you getting into the details. Do you chunk the text that goes into the BM25 index? For the hypothetical answer, do you also prompt for "chunk size" responses?
3. liukidar ◴[] No.42178206[source]
Thanks for sharing! These are all very helpful insights! We'll keep this in mind :)
4. antves ◴[] No.42178645[source]
Thanks for sharing this! It sounds very interesting. We experimented with a similar tree setup some time ago and it was giving good results. We eventually decided to move towards graphs as a general case of trees. I think the notion of using embeddings similarity for "walking" the graph is key, and we're actively integrating it in FastGraphRAG too by weighting the edges by the query. It's very nice to see so many solutions landing on similar designs!
5. sdesol ◴[] No.42178703[source]
> Hypothetical answer generation from a query using an LLM, and then using that hypothetical answer to query for embeddings works really well.

This is honestly wear I think LLM really shines. This also gives you a very good idea if your documentation is deficient or not.

6. yaj54 ◴[] No.42179361[source]
> 3. Hypothetical answer generation from a query using an LLM, and then using that hypothetical answer to query for embeddings works really well.

I've been wondering about that and am glad to hear it's working in the wild.

I'm now wondering if using a fine-tuned LLM (on the corpus) to gen the hypothetical answers and then use those for the rag flow would work even better.

replies(3): >>42181552 #>>42182937 #>>42190145 #
7. siquick ◴[] No.42179704[source]
> 1. Lexical search with BM25 alone gives you very relevant results if you can do some work during ingestion time with an LLM

Can you expand on what the LLM work here is and it’s purpose?

> 3. Hypothetical answer generation from a query using an LLM, and then using that hypothetical answer to query for embeddings works really well.

Interesting idea, going to add to our experiments. Thanks.

replies(1): >>42182734 #
8. gillesjacobs ◴[] No.42181552[source]
The technique of generating hypothetical answers (or documents) from the query was first described in the "HyDE (Hypothetical Document Expansion) paper". [1]

Interestingly, going both ways: generate hypothetical answers for the query, and also generate hypothetical questions for the text chunk at ingestion both increase RAG performance in my experience.

Though LLM-based query-processing is not always suitable for chat applications if inference time is a concer (like near-real time customer support RAG), so ingestion-time hypothetical answer generation is more apt there.

1. https://aclanthology.org/2023.acl-long.99/

9. andai ◴[] No.42182734[source]
It seems to come down to keyword expansion, though I'd be curious if there's more to it than just asking "please generate relevant keywords".
replies(1): >>42184700 #
10. tweezy ◴[] No.42182937[source]
We do this as well with a lot of success. It’s cool to see others kinda independently coalescing around this solution.

What we find really effective is at content ingestion time, we prepend “decorator text” to the document or chunk. This incorporates various metadata about the document (title, author(s), publication date, etc).

Then at query time, we generate a contextual hypothetical document that matches the format of the decorator text.

We add hybrid search (BM25 and rerank) to that, also add filters (documents published between these dates, by this author, this type of content, etc). We have an LLM parameterize those filters and use them as part of our retrieval step.

This process works incredibly for end users.

11. mhuffman ◴[] No.42183748[source]
My experience matches your's, but related to

>3. Hypothetical answer generation from a query using an LLM, and then using that hypothetical answer to query for embeddings works really well.

What sort of performance are you getting in production with this one? The other two are basically solved for performance and RAG in general if it is related to a known and pre-processed corpus but I am having trouble thinking of how you don't get a hit with #3.

replies(1): >>42187084 #
12. isoprophlex ◴[] No.42184367[source]
> LLMs are naively bad at identifying the correct level of granularity for the decomposed knowledge. One trick we found is to ask the LLM to output a mermaid.js mindmap to hierarchically break down the input into a tree. At the end of that output, ask the LLM to state which level is the appropriate root for a knowledge node. > Then the node is used to generate questions that could be answered from the knowledge contained in this node. We then index the text of these questions and also embed them.

Ha, that's brilliant. Thanks for sharing this!

13. sdesol ◴[] No.42184700{3}[source]
Something that I'm working on is making it easy to fix spelling and grammatical errors in documents that can affect BM25 and embeddings. So in addition to generating keyword/metadata with LLM, you could also ask it to clean the document; however, based on what I've learned so far, fixing spelling and grammatical errors should involve humans in the process, so you really can't automate this.
replies(2): >>42185565 #>>42186383 #
14. ◴[] No.42185058[source]
15. ◴[] No.42185435[source]
16. firejake308 ◴[] No.42185565{4}[source]
> fixing spelling and grammatical errors should involve humans in the process, so you really can't automate this

This is an interesting observation to me. I would have expected that, since LLMs evolved from autocomplete/autocorrect algorithms, correcting spelling mistakes would be one of their strong suits. Do you have examples of cases where they fail?

replies(1): >>42185710 #
17. sdesol ◴[] No.42185710{5}[source]
If you look at my post history, you can see an example of how claude and openai can not tell that GitHub is spelled correctly. The end result won't make a difference but it raises questions regarding how else it can misinterpret things.

At this moment I would not trust AI to automatically make changes.

replies(1): >>42187577 #
18. katelatte ◴[] No.42186316[source]
I organize community calls for Memgraph community and recently a community member presented how he uses hypothetical answer generation as a crucial component to enhancing the effectiveness and reliability of the system, allowing for more accurate and contextually appropriate responses to user queries. Here's more about it: https://memgraph.com/blog/precina-health-memgraph-graphrag-t...
19. andai ◴[] No.42186383{4}[source]
Fascinating. I think the process could be automated, though I don't know if it's been invented yet. You would want to use the existing autocomplete tech (probabilistic models based on Levenshtein distance and letter proximity on keyboard?) in combination with actually understanding the context of the article and using that to select the right correction. Actually, it sounds fairly trivial to slap those two together, and the 2nd half sounds like something a humble BERT could handle? (I've heard people getting great results with BERTs in current year, though they usually fine-tune them on their particular domain.)

I actually think even BERT could be overkill here -- I have a half-baked prototype of a keyword expansion system that should do the trick here. The idea is is to construct a data structure of keywords ahead of time (e.g. by data-mining some portion of Common Crawl), where each keyword has "neighbors" -- words that often appear together and (sometimes, but not always) signal relatedness. I didn't take the concept very far yet, but I give it better than even odds! (Especially if the resulting data structure is pruned by a half-decent LLM -- my initial attempts resulted in a lot of questionable "neighbors" -- though I had a fairly small dataset so it's likely I was largely looking at noise.)

replies(1): >>42187734 #
20. LASR ◴[] No.42187084[source]
It's slow. So we use hypothetical mostly for async experiences.

For live experiences like chat, we solved it with UX. As soon as you start typing the words of a question into the chat box, it does the FTS search and retrieves a set of documents that have word-matches, scored just using ES heuristics (eg: counting matching words etc)

These are presented as cards that expand when clicked. The user can see it's doing something.

While that's happening, also issue a full hyde flow in the background with a placeholder loading shimmer that loads in the full answer.

So there is some dead-time of about 10 seconds or so while it generates the hypothetical answers. After that, a short ~1 sec interval to load up the knowledge nodes, and then it starts streaming the answer.

This approach tested well with UXR participants and maintains acceptable accuracy.

A lot of the times, when looking for specific facts from a knowledge base, just the card UX gets an answer immediately. Eg: "What's the email for product support?"

21. spdustin ◴[] No.42187577{6}[source]
My answer to this in my own pet project is to mask terms found by the NER pipeline from being corrected, replacing them with their entity type as a special token (e.g. [male person] or [commercial entity]). That alone dramatically improved grammar/spelling correction, especially because the grammatical "gist" of those masked words is preserved in the text presented to the LLM for "correction".
22. sdesol ◴[] No.42187734{5}[source]
> I think the process could be automated

It can definitely be automated in my opinion, if you go with a supermajority workflow. Something that I've noticed with LLMs is it's very unlikely for all high-quality LLM models to be wrong at the same time. So if you go by a supermajority, the changes are almost certainly valid.

Having said all of that, I still believe we are not addressing the root cause of bad searches which is "garbage in, garbage out". I strongly believe the true calling for LLM will be to help us curate and manage data, at scale.

23. oedemis ◴[] No.42190145[source]
but what about the chunk size, if we have a small chunks like 1 sentence and the hyde embeddings are most of the time larger, the results are not so good
24. itissid ◴[] No.42193843[source]
Very cool and relatable I faced a similar issue for my content categorization engine for local events: http://drophere.co/presence/where (code: https://github.com/itissid/drop_webdemo). Finding the right category for a local event is difficult, an event could be "Outdoorsy" but also "Family Fun" and "Urban Exploration".

Initially I generated categories by asking an LLM with a long prompt(https://github.com/itissid/Drop-PoT/blob/main/src/drop_backe...) But I like your idea better!

My next iteration to solve this problem – I never got to it – was gonna be to generate the most appropriate categories based on user's personal interest, weather, time of day and non PII data and fine-tune a retrieval and a ranking engine to generate categories for each content piece personalized to them.