At this point, the onus is on the developer to prove it's value through AB comparisons versus traditional RAG. No person/team has the bandwidth to try out this (n + 1) solution.
I see a lot of these KG tools pop up, but they never solve the first problem I have, which is actually constructing the KG itself.
1. langchain, llamaindex, etc are the equivalent of jquery or ORMs for calling third-party LLMs. They're thin adapter layers with a bit of consistency and common tasks across. Arguably like React, where they are thin composition layers. So complaints of being leaky abstractions is in the sense of an ORM getting in the way vs helping.
2. KG/graph RAG libraries are the LLM equivalent of, when regex + LIKE sql statements aren't enough, graduating to a full-blown lucene/solr engine. These are intelligence engines that address index-time, query-time, and likely, both. Thin libraries and those lacking standard benchmarks are a sign of experiments vs production-relevant: unless you're just talking to 1 pdf, not likely what you want. IMO, no 'winners' here yet: llamaindex was part of an early wave of preprocessors that feed PDFs etc to the KG, but not winning the actual 'smart' KG/RAG. In contrast, MSR Graph RAG is popular and benchmarks well, but if you read the github & paper, not intended for use -- ex: it addresses 1 family of infrequent query you'd do in a RAG system ("n-hop"), but not the primary kinds like mixing semantic+keyword search with query rewriting, and struggles with basics like updates.
Most VC infra/DB $ goes to a layer below the KG. For example, vector databases -- but vector DBs are relatively dumb blackboxes, you can think of them more like S3 or a DB index, while the LLM KG/AI quality work is generally a layer above. (We do train & tune our embedding models, but that's a tiny % of the ultimate win, mostly for smarter compression for handling scaling costs, not the bigger smarts.)
+ 1 to presentation being confusing! VC $ on agents, vector DB co's, etc, and well-meaning LLM enthusiasts are cranking out articles on small uses of LLMs, but in reality, these end up being pretty crappy in quality if you'd actually ship them. So once quality matters, you get into things like the KG/graph RAG work & evals, which is a lot more effort & grinding => smaller % of the infotainment & marketing going around.
(We do this stuff at real-time & data-intensive scales as part of Louie.AI, and are always looking for design partners, esp on graph rag, so happy to chat.)
But we need a theory on the differences too. Now it is kind of random how we differentiate the tools. We need ergonomics for llms.
imo, none. Unfortunately, the landscape is changing too fast. May be things will stabilize, but for now I find experimentation a time-consuming but essential part of maintaining any ML stack.
But it's okay not to experiment with every new tool (it can be overwhelming to do this). The key is in understanding one's own stack and filtering out anything that doesn't fit into it.
I have heard good things about Graphrag [1] (but what a stupid name). I did not have the time to try it properly, but it is supposed to build the knowledge graph itself somewhat transparently, using LLMs. This is a big stumbling block. At least vector stores are easy to understand and trivial to build.
It looks like KAG can do this from the summary on GitHub, but I could not really find how to do it in the documentation.
https://github.com/OpenSPG/KAG/blob/master/kag/builder/promp...
All you’re doing here is “front loading” AI: Imstead of running slow and expensive LLMs at query time, you run them at index time.
It’s a method for data augmentation or, in database lingo, index building. You use LLMs to add context to chunks that doesn’t exist on either the word level (searchable by BM25) or the semantic level (searchable by embeddings).
A simple version of this would be to ask an LLM:
“List all questions this chunk is answering.” [0]
But you can do the same thing for time frames, objects, styles, emotions — whatever you need a “handle” for to later retrieve via BM25 or semantic similarity.
I dreamed of doing that back in 2020, but it would’ve been prohibitively expensive. Because it requires passing your whole corpus through an LLM, possibly multiple times, once for each “angle”.
That being said, I recommend running any “Graph RAG” system you see here on HN over some 1% or so of your data. And then look inside the database. Look at all text chunks, original and synthetic, that are now in your index.
I’ve done this for a consulting client who absolutely wanted “Graph RAG”. I found the result to be an absolute mess. That is because these systems are built to cover a broad range of applications and are not adapted at all to your problem domain.
So I prefer working backwards:
What kinds of queries do I need to handle? What does the prompt to my query time LLM need to look like? What context will the LLM need? How can I have this context for each of my chunks, and be able to search by match air similarity? And now how can I make an LLM return exactly that kind of context, with as few hallucinations and as little filler as possible, for each of my chunks?
This gives you a very lean, very efficient index that can do everything you want.
[0] For a prompt, you’d add context and give the model “space to think”, especially when using a smaller model. Also, you’d instruct it to use a particular format, so you can parse out the part that you need. This “unfancy” approach lets you switch out models easily and compare them against each other without having to care about different APIs for “structured output”.
2.2.
"The engine includes three types of operators: planning, reasoning, and retrieval, which transform natural language problems into problem solving processes that combine language and notation.
In this process, each step can use different operators, such as exact match retrieval, text retrieval, numerical calculation or semantic reasoning, so as to realize the integration of four different problem solving processes: Retrieval, Knowledge Graph reasoning, language reasoning and numerical calculation."
what exactly is being tokenized? RDS, OWL, Neo4j, ...?
how is the knowledge graph serialized?
When I need to build something for an LLM to use, I ask the LLM to build it. That way, by definition, the LLM has a built in understanding of how the system should work, because the LLM itself invented it.
Similarly, when I was doing some experiments with a GPT-4 powered programmer, in the early days I had to omit most of the context (just have method stubs). During that time I noticed that most of the code written by GPT-4 was consistently the same. So I could omit its context because the LLM would already "know" (based on its mental model) what the code should be.
Really? I’m not sure that the word “understanding” means the same thing to you as it does to me.
If you want a transformational shift in terms of accuracy and reasoning, the answer is different. Many a times RAG accuracy suffers because the text is out of distribution, and ICL does not work well. You get away with it if all your data is in public domain in some form (ergo, llm was trained on it), else you keep seeing the gaps with no way to bridge them. I published a paper around it and how to effciently solve it, if interested. Here is a simplified blog post on the same: https://medium.com/@ankit_94177/expanding-knowledge-in-large...
Edit: Please reach out here or on email if you would like further details. I might have skipped too many things in the above comment.
This is realistic but hence going to be unpopular unfortunately, because people expect magic / want zero effort.
The pace at which things are moving, likely none. You will have to keep making changes as and when you see newer things. One thing in your favor (arguably) is that every technique is very dependent on the dataset and problem you are solving. So, if you do not have the latest one implemented, you would be okay, as long as your evals and metrics are good. So, if this helps, skip the details, understand the basics, and go for your own implementation. One thing to look out for is new SOTA LLM releases, and the jumps in capability. Eg: 4o did not announce it, but they started doing very well on vision. (GPT-4 was okay, 4o is empirically quite better). These things help when you update your pipeline.
> The white paper is only available for professional developers from different industries. We need to collect your name, contact information, email address, company name, industry type, position and your download purpose to verify your identity...
That's new.
Thats not how an LLM works. It doesn't understand your question, nor the answer. It can only give you a statistically significant sequence of words that should follow what you gave it.
I’ve heard of a few very large companies using glean (https://www.glean.com/)
This is the route I’d take if I wanted to make a business around rag.
It’s not hard for a product to swap the underlying LLM for a given task.
Just finished a call a few mins, and we came to conclusion we do natural query language, BM25 scoring with Tantivy based code first
https://github.com/quickwit-oss/tantivy
In meanwhile we collect all questions to ask LLM so we can be more consious at Hybrid Search implementation phase
It is trivial, completely devoid of any creativity, and most importantly quite difficult to google. It’s like they did not really think about it even for 5 seconds before uploading.
> if anything its too generic and multiple people who have the same idea now cannot use the name bc microsoft made the most noise about it.
Exactly! Anyway, I am not judging the software, which I have yet to try properly.
I have to agree. It’s actually quite a good summary of hacking with AI-related libraries these days. A lot of them get complex fast once you get slightly out of the intended path. I hope it’ll get better, but unfortunately it is where we are.
[1] https://github.com/microsoft/graphrag/tree/main/graphrag/pro...
https://github.com/getzep/graphiti
I’m one of the authors. Happy to answer any questions.
That's not even correct, starring isn't going to do that. You'd need to smash that subscribe button and not forget the bell icon (metaphorically), not ~like~ star it.
NLP is fast but requires a model that is trained on an ontology that works with your data. Once you do, it’s a matter of simply feeling the model your bazillion CSVs and PDFs.
LLMs are slow but way easier to start as ontologies can be generated on the fly. This is a double edged sword however as LLMs have a tendency to lose fidelity and consistency on edge naming.
I work in NLP, which is the most used in practice as it’s far more consistent and explainable in very large corpora. But the difficulty in starting a fresh ontology dead ends many projects.
Don't have time to scan the source code myself, but are you using the OpenAI python library, so the server URL can easily be changed? Didn't see it exposed by your library, so hoping it can at least be overridden with a env var, so we could use local LLMs instead.
This is a common issue I've seen from LLM projects that only kind-of understand what is going on here and try and separate their vector database w/ semantic edge information into something that has a formal name.
I recommend looking at some simple spark queries to get an idea of what’s happening.
What I’ve seen is using LLMs to identify what possible relationships some information may have by comparing it to the kinds of relationships in your database.
Then when building the spark query it uses those relationships to query relevant data.
The llm never digests the graph. The system around the llm uses the capabilities of graph data stores to find relevant context for the llm.
What you’ll find with most RAG systems is that the LLM plays a smaller part than you’d think.
It reveals semantic information (such as conceptual relationships) and generates final responses. The system around it is where the far more interesting work happens imo.
I’ve noticed this too and the ironic thing is that building the KG is the most critical part of making everything work.
Then Google did.
Then llava.
The issue is that this technology has no most (other than the cost to create models and datasets)
There’s not a lot of secret sauce you can use that someone else can’t trivially replicate, given the resources.
It’s going to come down to good ol product design and engineering.
The issue is openai doesn’t seem to care about what their users want. (I don’t think their users know what they want either, but that’s another discussion)
They want more money to make bigger models in the hope that nobody else can or will.
They want to achieve regulatory capture as their moat.
For all their technical abilities at scaling LLM training and inference, I don’t get the feeling that they have great product direction.
https://github.com/OpenSPG/KAG/blob/master/kag/builder/promp...
GraphRAG and a lot of the semantic indexes are simply vector database with pre-computed similarity edges which does not allow you to perform any reasoning over (the definition and intention of a knowledge graph).
This is probably worth looking at, its the first opensource project I've seen that is actually using LLMs to generate knowledge graphs. This does look pretty primitive for that task but it might be a useful reference for others going down this road.
This is actually attempting fact extraction into an ontology so you can reason over this instead of reasoning in the LLM.
> We recommend that you put this on a local fork as we really want the service to be as lightweight and simple as possible as we see this asa good entry point into new developers.
Sadly, it seems like you're recommending forking the library instead of allowing people to use local LLMs. You were smart enough to lock the PR from any further conversation at least :)
,after_submitting: 'https://spg.openkg.cn/en-US/download?token=0a735e9a-72ea-11ee-b962-0242ac120002'
https://mdn.alipayobjects.com/huamei_xgb3qj/afts/file/A*6gpq...https://neuml.hashnode.dev/advanced-rag-with-graph-path-trav...
In fact, im wondering if thats what happened in the early noughts and we had the misfortune of Java, and still have the misfortune of Javascript.
Both are cut from the same cloth of typical inexperienced devs who made something cool in a new space and posted on GitHub but then immediately morphed into a companies trying to trap users etc. without going through an organic lifecycle of growing, improving, refactoring with the community.
So yes, there's a huge pile of tools and software for working with knowledge graphs, but to date populating the graph is still the realm of human experts.
Same findings here, re: legal text. Basic hybrid search performs better. In this use case the user knows what to look for, so the queries are specific. The advantage of graph RAG is when you need to integrate disparate sources for a holistic overview.
Perhaps one needs to manually create a starting point then ask the LLM to propse links to various documents or follow an existing one.
Sufficiently loopable transversal should create a KG
This becomes a cyclical hallucination problem. The LLM hallucinates and create incorrect graph which in turn creates even more incorrect knowledge.
We are working on this issue of reducing hallucination in knowledge graphs and using LLM is not at all the right way.
Retrieving one with low latency is another.
I’ve had good success with CIM for Utilities to build a network graph for modelling the distribution and transmission networks adding sensor and event data for monitoring and analysis about 15 years ago.
Anywhere there is a technology focussed consortium of vendors and users building standards you will likely find a prebuilt graph. When RDF was “hot” many of the these groups spun out some attempt to model their domain.
In summary, if you need one look for one. Maybe there’s one waiting for you and you get to do less convincing and more doing.