Show HN: FastGraphRAG – Better RAG using good old PageRank

from fast_graphrag import GraphRAG DOMAIN = "Analyze this story and identify the characters. Focus on how they interact with each other, the locations they explore, and their relationships." EXAMPLE_QUERIES = [ "What is the significance of Christmas Eve in A Christmas Carol?", "How does the setting of Victorian London contribute to the story's themes?", "Describe the chain of events that leads to Scrooge's transformation.", "How does Dickens use the different spirits (Past, Present, and Future) to guide Scrooge?", "Why does Dickens choose to divide the story into \"staves\" rather than chapters?" ] ENTITY_TYPES = ["Character", "Animal", "Place", "Object", "Activity", "Event"] grag = GraphRAG( working_dir="./book_example", domain=DOMAIN, example_queries="\n".join(EXAMPLE_QUERIES), entity_types=ENTITY_TYPES ) with open("./book.txt") as f: grag.insert(f.read()) print(grag.query("Who is Scrooge?").response)

So I've done a ton of work in this area.

Few learnings I've collected:

1. Lexical search with BM25 alone gives you very relevant results if you can do some work during ingestion time with an LLM.

2. Embeddings work well only when the size of the query is roughly on the same order of what you're actually storing in the embedding store.

3. Hypothetical answer generation from a query using an LLM, and then using that hypothetical answer to query for embeddings works really well.

So combining all 3 learnings, we landed on a knowledge decomposition and extraction step very similar to yours. But we stick a metaprompter to essentially auto-generate the domain / entity types.

LLMs are naively bad at identifying the correct level of granularity for the decomposed knowledge. One trick we found is to ask the LLM to output a mermaid.js mindmap to hierarchically break down the input into a tree. At the end of that output, ask the LLM to state which level is the appropriate root for a knowledge node.

Then the node is used to generate questions that could be answered from the knowledge contained in this node. We then index the text of these questions and also embed them.

You can directly match the user's query from these questions using purely BM25 and get good outputs. But a hybrid approach works even better, though not by that much.

Not using LLMs are query time also means we can hierarchically walk down the root into deeper and deeper nodes, using the embedding similiarity as a cost function for the traversal.