(github.com)

457 points liukidar | 1 comments | 18 Nov 24 17:43 UTC | HN request time: 0.283s | source

Hey there HN! We’re Antonio, Luca, and Yuhang, and we’re excited to introduce Fast GraphRAG, an open-source RAG approach that leverages knowledge graphs and the 25 years old PageRank for better information retrieval and reasoning.

Building a good RAG pipeline these days takes a lot of manual optimizations. Most engineers intuitively start from naive RAG: throw everything in a vector database and hope that semantic search is powerful enough. This can work for use cases where accuracy isn’t too important and hallucinations are tolerable, but it doesn’t work for more difficult queries that involve multi-hop reasoning or more advanced domain understanding. Also, it’s impossible to debug it.

To address these limitations, many engineers find themselves adding extra layers like agent-based preprocessing, custom embeddings, reranking mechanisms, and hybrid search strategies. Much like the early days of machine learning when we manually crafted feature vectors to squeeze out marginal gains, building an effective RAG system often becomes an exercise in crafting engineering “hacks.”

Earlier this year, Microsoft seeded the idea of using Knowledge Graphs for RAG and published GraphRAG - i.e. RAG with Knowledge Graphs. We believe that there is an incredible potential in this idea, but existing implementations are naive in the way they create and explore the graph. That’s why we developed Fast GraphRAG with a new algorithmic approach using good old PageRank.

There are two main challenges when building a reliable RAG system:

(1) Data Noise: Real-world data is often messy. Customer support tickets, chat logs, and other conversational data can include a lot of irrelevant information. If you push noisy data into a vector database, you’re likely to get noisy results.

(2) Domain Specialization: For complex use cases, a RAG system must understand the domain-specific context. This requires creating representations that capture not just the words but the deeper relationships and structures within the data.

Our solution builds on these insights by incorporating knowledge graphs into the RAG pipeline. Knowledge graphs store entities and their relationships, and can help structure data in a way that enables more accurate and context-aware information retrieval. 12 years ago Google announced the knowledge graph we all know about [1]. It was a pioneering move. Now we have LLMs, meaning that people can finally do RAG on their own data with tools that can be as powerful as Google’s original idea.

Before we built this, Antonio was at Amazon, while Luca and Yuhang were finishing their PhDs at Oxford. We had been thinking about this problem for years and we always loved the parallel between pagerank and the human memory [2]. We believe that searching for memories is incredibly similar to searching the web.

Here’s how it works:

- Entity and Relationship Extraction: Fast GraphRAG uses LLMs to extract entities and their relationships from your data and stores them in a graph format [3].

- Query Processing: When you make a query, Fast GraphRAG starts by finding the most relevant entities using vector search, then runs a personalized PageRank algorithm to determine the most important “memories” or pieces of information related to the query [4].

- Incremental Updates: Unlike other graph-based RAG systems, Fast GraphRAG natively supports incremental data insertions. This means you can continuously add new data without reprocessing the entire graph.

- Faster: These design choices make our algorithm faster and more affordable to run than other graph-based RAG systems because we eliminate the need for communities and clustering.

Suppose you’re analyzing a book and want to focus on character interactions, locations, and significant events:

  from fast_graphrag import GraphRAG
  
  DOMAIN = "Analyze this story and identify the characters. Focus on how they interact with each other, the locations they explore, and their relationships."
  
  EXAMPLE_QUERIES = [
      "What is the significance of Christmas Eve in A Christmas Carol?",
      "How does the setting of Victorian London contribute to the story's themes?",
      "Describe the chain of events that leads to Scrooge's transformation.",
      "How does Dickens use the different spirits (Past, Present, and Future) to guide Scrooge?",
      "Why does Dickens choose to divide the story into \"staves\" rather than chapters?"
  ]
  
  ENTITY_TYPES = ["Character", "Animal", "Place", "Object", "Activity", "Event"]
  
  grag = GraphRAG(
      working_dir="./book_example",
      domain=DOMAIN,
      example_queries="\n".join(EXAMPLE_QUERIES),
      entity_types=ENTITY_TYPES
  )
  
  with open("./book.txt") as f:
      grag.insert(f.read())
  
  print(grag.query("Who is Scrooge?").response)

This code creates a domain-specific knowledge graph based on your data, example queries, and specified entity types. Then you can query it in plain English while it automatically handles all the data fetching, entity extractions, co-reference resolutions, memory elections, etc. When you add new data, locking and checkpointing is handled for you as well.

This is the kind of infrastructure that GenAI apps need to handle large-scale real-world data. Our goal is to give you this infrastructure so that you can focus on what’s important: building great apps for your users without having to care about manually engineering a retrieval pipeline. In the managed service, we also have a suite of UI tools for you to explore and debug your knowledge graph.

We have a free hosted solution with up to 100 monthly requests. When you’re ready to grow, we have paid plans that scale with you. And of course you can self host our open-source engine.

Give us a spin today at https://circlemind.co and see our code at https://github.com/circlemind-ai/fast-graphrag

We’d love feedback :)

[1] https://blog.google/products/search/introducing-knowledge-gr...

[2] Griffiths, T. L., Steyvers, M., & Firl, A. (2007). Google and the Mind: Predicting Fluency with PageRank. Psychological Science, 18(12), 1069–1076. http://www.jstor.org/stable/40064705

[3] Similarly to Microsoft’s GraphRAG: https://github.com/microsoft/graphrag

[4] Similarly to OSU’s HippoRAG: https://github.com/OSU-NLP-Group/HippoRAG

https://vhs.charm.sh/vhs-4fCicgsbsc7UX0pemOcsMp.gif

Show context

michelpp ◴[18 Nov 24 22:50 UTC] No.42177973[source]▶

>>42174829 (OP) #

PageRank is one of several interesting centrality metrics that could be applied to a graph to influence RAG on structural data, another one is Triangle Centrality which counts triangles around nodes to figure out their centrality based on the concept that triangles close relationships into a strong bond, where open bonds dilute centrality by drawing weight away from the center:

https://arxiv.org/abs/2105.00110

The paper shows high efficiency compared to other centralities like PageRank, however in some research using the GraphBLAS I and my coauthors found that TC was slower on a variety of sparse graphs than our sparse formulation of PR for graphs up to 1.8 billion edges, but that TC appears to scale better as graphs get larger and is likely more efficient in the trillion edge realm.

https://fossies.org/linux/SuiteSparse/GraphBLAS/Doc/The_Grap...

replies(2): >>42178100 #>>42179763 #

liukidar ◴[18 Nov 24 23:02 UTC] No.42178100[source]▶

>>42177973 #

This is super interesting! Thanks for sharing. Here we are talking of graphs in the milions nodes/edges, so efficiency is not that big of a deal, since anyway things are gonna be parsed by a LLM to craft an asnwer which will always be the bottleneck. Indeed PageRank is the first step, but we would be happy to test more accurate alternatives. Importantly, we are using personalized pagerank here, meaning we give specific intial weights to a set (potentially quite large) of nodes, would TC support that (as well as giving weight to edges, since we are also looking into that)?

replies(1): >>42183404 #

michelpp ◴[19 Nov 24 13:50 UTC] No.42183404[source]▶

>>42178100 #

> Here we are talking of graphs in the milions nodes/edges,

That ought to be enough for anybody.

> would TC support that

TC is a purely structural algorithm, it counts triangles so it doesn't take any weights into consideration, but it does return a vector of normalized ranking from 0.0 to 1.0, which you could combine with an existing biasing strategy to boost results that have strong centrality.

replies(1): >>42188168 #

1. lmeyerov ◴[19 Nov 24 21:16 UTC] No.42188168[source]▶

>>42183404 #

Hah indeed, we are doing billion-scale real-time graph rag in louie.ai for fairly regular tasks, so your sentiment resonates ;-)

For something like uploading a big folder of documents, agree with the OP, pretty straightforward, naive in-memory with out-of-the-box embeddings, LLMs, retrieval, and untuned DBs goes far. I expect most vector-supporting dbaas and LLMaaS to be offering in the new year. OpenAI, Claude, and friends are already going in this direction, leaving the rag techniques opaque for now.

(Something folks may not appreciate, and I think is important about what's being done here, is the incremental update aspect.)

↑

Show HN: FastGraphRAG – Better RAG using good old PageRank