←back to thread

Show HN: FastGraphRAG – Better RAG using good old PageRank

(github.com)

457 points liukidar | 2 comments | 18 Nov 24 17:43 UTC | HN request time: 0.401s | source

Hey there HN! We’re Antonio, Luca, and Yuhang, and we’re excited to introduce Fast GraphRAG, an open-source RAG approach that leverages knowledge graphs and the 25 years old PageRank for better information retrieval and reasoning.

Building a good RAG pipeline these days takes a lot of manual optimizations. Most engineers intuitively start from naive RAG: throw everything in a vector database and hope that semantic search is powerful enough. This can work for use cases where accuracy isn’t too important and hallucinations are tolerable, but it doesn’t work for more difficult queries that involve multi-hop reasoning or more advanced domain understanding. Also, it’s impossible to debug it.

To address these limitations, many engineers find themselves adding extra layers like agent-based preprocessing, custom embeddings, reranking mechanisms, and hybrid search strategies. Much like the early days of machine learning when we manually crafted feature vectors to squeeze out marginal gains, building an effective RAG system often becomes an exercise in crafting engineering “hacks.”

Earlier this year, Microsoft seeded the idea of using Knowledge Graphs for RAG and published GraphRAG - i.e. RAG with Knowledge Graphs. We believe that there is an incredible potential in this idea, but existing implementations are naive in the way they create and explore the graph. That’s why we developed Fast GraphRAG with a new algorithmic approach using good old PageRank.

There are two main challenges when building a reliable RAG system:

(1) Data Noise: Real-world data is often messy. Customer support tickets, chat logs, and other conversational data can include a lot of irrelevant information. If you push noisy data into a vector database, you’re likely to get noisy results.

(2) Domain Specialization: For complex use cases, a RAG system must understand the domain-specific context. This requires creating representations that capture not just the words but the deeper relationships and structures within the data.

Our solution builds on these insights by incorporating knowledge graphs into the RAG pipeline. Knowledge graphs store entities and their relationships, and can help structure data in a way that enables more accurate and context-aware information retrieval. 12 years ago Google announced the knowledge graph we all know about [1]. It was a pioneering move. Now we have LLMs, meaning that people can finally do RAG on their own data with tools that can be as powerful as Google’s original idea.

Before we built this, Antonio was at Amazon, while Luca and Yuhang were finishing their PhDs at Oxford. We had been thinking about this problem for years and we always loved the parallel between pagerank and the human memory [2]. We believe that searching for memories is incredibly similar to searching the web.

Here’s how it works:

- Entity and Relationship Extraction: Fast GraphRAG uses LLMs to extract entities and their relationships from your data and stores them in a graph format [3].

- Query Processing: When you make a query, Fast GraphRAG starts by finding the most relevant entities using vector search, then runs a personalized PageRank algorithm to determine the most important “memories” or pieces of information related to the query [4].

- Incremental Updates: Unlike other graph-based RAG systems, Fast GraphRAG natively supports incremental data insertions. This means you can continuously add new data without reprocessing the entire graph.

- Faster: These design choices make our algorithm faster and more affordable to run than other graph-based RAG systems because we eliminate the need for communities and clustering.

Suppose you’re analyzing a book and want to focus on character interactions, locations, and significant events:

  from fast_graphrag import GraphRAG
  
  DOMAIN = "Analyze this story and identify the characters. Focus on how they interact with each other, the locations they explore, and their relationships."
  
  EXAMPLE_QUERIES = [
      "What is the significance of Christmas Eve in A Christmas Carol?",
      "How does the setting of Victorian London contribute to the story's themes?",
      "Describe the chain of events that leads to Scrooge's transformation.",
      "How does Dickens use the different spirits (Past, Present, and Future) to guide Scrooge?",
      "Why does Dickens choose to divide the story into \"staves\" rather than chapters?"
  ]
  
  ENTITY_TYPES = ["Character", "Animal", "Place", "Object", "Activity", "Event"]
  
  grag = GraphRAG(
      working_dir="./book_example",
      domain=DOMAIN,
      example_queries="\n".join(EXAMPLE_QUERIES),
      entity_types=ENTITY_TYPES
  )
  
  with open("./book.txt") as f:
      grag.insert(f.read())
  
  print(grag.query("Who is Scrooge?").response)

This code creates a domain-specific knowledge graph based on your data, example queries, and specified entity types. Then you can query it in plain English while it automatically handles all the data fetching, entity extractions, co-reference resolutions, memory elections, etc. When you add new data, locking and checkpointing is handled for you as well.

This is the kind of infrastructure that GenAI apps need to handle large-scale real-world data. Our goal is to give you this infrastructure so that you can focus on what’s important: building great apps for your users without having to care about manually engineering a retrieval pipeline. In the managed service, we also have a suite of UI tools for you to explore and debug your knowledge graph.

We have a free hosted solution with up to 100 monthly requests. When you’re ready to grow, we have paid plans that scale with you. And of course you can self host our open-source engine.

Give us a spin today at https://circlemind.co and see our code at https://github.com/circlemind-ai/fast-graphrag

We’d love feedback :)

[1] https://blog.google/products/search/introducing-knowledge-gr...

[2] Griffiths, T. L., Steyvers, M., & Firl, A. (2007). Google and the Mind: Predicting Fluency with PageRank. Psychological Science, 18(12), 1069–1076. http://www.jstor.org/stable/40064705

[3] Similarly to Microsoft’s GraphRAG: https://github.com/microsoft/graphrag

[4] Similarly to OSU’s HippoRAG: https://github.com/OSU-NLP-Group/HippoRAG

https://vhs.charm.sh/vhs-4fCicgsbsc7UX0pemOcsMp.gif

Show context

adamgordonbell ◴[18 Nov 24 20:14 UTC] No.42176406[source]▶

>>42174829 (OP) #

So what is the answer to "Who is Scrooge?" and is it different / better than another approach?

( Like whole thing in contenxt window for instance? )

Is this approach just for cost savings or does it help get better answers and how so?

Could you share a specific example?

replies(1): >>42176656 #

1. antves ◴[18 Nov 24 20:34 UTC] No.42176656[source]▶

Generally speaking RAG comes in the game when it is impractical to use large context windows for three reasons: (1) accuracy drops as you stuff the context windows, (2) currently, context windows do not scale past 1M tokens, and (3) even with caching, moving millions of tokens is wasteful and not viable both in terms of costs and latency.

So we should really compare this to other RAG approaches. If we compare it to vector databases RAG, knowledge graphs have the advantage that they model the connections between datapoints. This is super important when asking questions that requires to reason across multiple pieces of information, i.e. multi-hop reasoning.

Also, the graph construction is essentially an exercise in cleaning data to extract the knowledge. Let me give you a practical example. Let's pretend we're indexing customer tickets for creating an AI assistant. If we were to store the data on the tickets as it is, we would overwhelm the vector database with all the noise coming from the conversational nature of this data. With knowledge graphs, we extract only the relevant entities and relationships and store the distilled knowledge in our graph. At query time, we find the answer over a structured data model that contains only clean information

replies(1): >>42177343 #

2. adamgordonbell ◴[18 Nov 24 21:35 UTC] No.42177343[source]▶

>>42176656 (TP) #

Makes sense, but so can you compare it to to RAG then and show how an answer is superior and what the context contains that makes it superior?

Or how it is close to large context quality of answer with lower cost on some specific examples.

It's helpful when a readme contains a demonstration or as I said above, a specific example.