Most active commenters
  • kergonath(3)
  • diggan(3)

←back to thread

230 points taikon | 31 comments | | HN request time: 1.406s | source | bottom
1. isoprophlex ◴[] No.42547133[source]
Fancy, I think, but again no word on the actual work of turning a few bazillion csv files and pdf's into a knowledge graph.

I see a lot of these KG tools pop up, but they never solve the first problem I have, which is actually constructing the KG itself.

replies(11): >>42547488 #>>42547556 #>>42547743 #>>42548481 #>>42549416 #>>42549856 #>>42549911 #>>42550327 #>>42551738 #>>42552272 #>>42562692 #
2. kergonath ◴[] No.42547488[source]
> I see a lot of these KG tools pop up, but they never solve the first problem I have, which is actually constructing the KG itself.

I have heard good things about Graphrag [1] (but what a stupid name). I did not have the time to try it properly, but it is supposed to build the knowledge graph itself somewhat transparently, using LLMs. This is a big stumbling block. At least vector stores are easy to understand and trivial to build.

It looks like KAG can do this from the summary on GitHub, but I could not really find how to do it in the documentation.

[1] https://microsoft.github.io/graphrag/

replies(3): >>42547518 #>>42547785 #>>42550262 #
3. isoprophlex ◴[] No.42547518[source]
Indeed they seem to actually know/show how the sausage is made... but still, no fire and forget approach for any random dataset. check out what you need to do if the default isnt working for you (scroll down to eg. entity_extraction settings). there is so much complexity there to deal with that i'd just roll my own extraction pipeline from the start, rather than learning someone elses complex setup (that you have to tweak for each new usecase)

https://microsoft.github.io/graphrag/config/yaml/

replies(2): >>42549293 #>>42549804 #
4. jimmySixDOF ◴[] No.42547556[source]
There is some automated named entity extraction and relationship building out of un/semi structured data as part of the neo4j onboarding now to go with all these GraphRAG efforts (& maybe honorable mention to WhyHow.ai too)
5. fermisea ◴[] No.42547743[source]
We're trying to solve this problem at ergodic.ai, combining structured tables and pdfs into a single KG
replies(1): >>42554318 #
6. swyx ◴[] No.42547785[source]
why stupid? it uses a Graph in RAG. graphrag. if anything its too generic and multiple people who have the same idea now cannot use the name bc microsoft made the most noise about it.
replies(2): >>42548744 #>>42549276 #
7. bkovacev ◴[] No.42548481[source]
I have been building something like this for myself. Is there a room for a paid software, and would you be willing to pay for something like that?
replies(1): >>42549040 #
8. washadjeffmad ◴[] No.42548744{3}[source]
It's the sound and Bobcat Goldthwait would have made if he'd voiced the Aflac duck.
9. dartos ◴[] No.42549040[source]
IMO there is only a B2B market for this kind of thing.

I’ve heard of a few very large companies using glean (https://www.glean.com/)

This is the route I’d take if I wanted to make a business around rag.

10. kergonath ◴[] No.42549276{3}[source]
> why stupid? it uses a Graph in RAG. graphrag.

It is trivial, completely devoid of any creativity, and most importantly quite difficult to google. It’s like they did not really think about it even for 5 seconds before uploading.

> if anything its too generic and multiple people who have the same idea now cannot use the name bc microsoft made the most noise about it.

Exactly! Anyway, I am not judging the software, which I have yet to try properly.

11. kergonath ◴[] No.42549293{3}[source]
> i'd just roll my own extraction pipeline from the start, rather than learning someone elses complex setup

I have to agree. It’s actually quite a good summary of hacking with AI-related libraries these days. A lot of them get complex fast once you get slightly out of the intended path. I hope it’ll get better, but unfortunately it is where we are.

12. axpy906 ◴[] No.42549416[source]
Came here to say this and glad I am not the only one. Building out an ontology seems like quite an expensive process. It would be hard to convince my stakeholders to do this.
replies(1): >>42563165 #
13. veggieroll ◴[] No.42549804{3}[source]
IMO like with most other out-of-the-box LLM frameworks, the value is in looking at their prompts and then doing it yourself.

[1] https://github.com/microsoft/graphrag/tree/main/graphrag/pro...

14. roseway4 ◴[] No.42549856[source]
You may want to take a look at Graphiti, which accepts plaintext or JSON input and automatically constructs a KG. While it’s primarily designed to enable temporal use cases (where data changes over time), it works just as well with static content.

https://github.com/getzep/graphiti

I’m one of the authors. Happy to answer any questions.

replies(3): >>42549922 #>>42555303 #>>42555979 #
15. jeromechoo ◴[] No.42549911[source]
There are two paths to KG generation today and both are problematic in their own ways. 1. Natural Language Processing (NLP) 2. LLM

NLP is fast but requires a model that is trained on an ontology that works with your data. Once you do, it’s a matter of simply feeling the model your bazillion CSVs and PDFs.

LLMs are slow but way easier to start as ontologies can be generated on the fly. This is a double edged sword however as LLMs have a tendency to lose fidelity and consistency on edge naming.

I work in NLP, which is the most used in practice as it’s far more consistent and explainable in very large corpora. But the difficulty in starting a fresh ontology dead ends many projects.

16. diggan ◴[] No.42549922[source]
> Graphiti uses OpenAI for LLM inference and embedding. Ensure that an OPENAI_API_KEY is set in your environment. Support for Anthropic and Groq LLM inferences is available, too.

Don't have time to scan the source code myself, but are you using the OpenAI python library, so the server URL can easily be changed? Didn't see it exposed by your library, so hoping it can at least be overridden with a env var, so we could use local LLMs instead.

replies(1): >>42550375 #
17. TrueDuality ◴[] No.42550262[source]
GraphRAG isn't quite a knowledge graph. It is a graph of document snippets with semantic relations but is not doing fact extraction nor can you do any reasoning over the structure itself.

This is a common issue I've seen from LLM projects that only kind-of understand what is going on here and try and separate their vector database w/ semantic edge information into something that has a formal name.

18. melvinmelih ◴[] No.42550327[source]
> but they never solve the first problem I have, which is actually constructing the KG itself.

I’ve noticed this too and the ironic thing is that building the KG is the most critical part of making everything work.

19. diggan ◴[] No.42550375{3}[source]
On second look, it seems like you've already rejected a PR trying to add local LLM support: https://github.com/getzep/graphiti/pull/184

> We recommend that you put this on a local fork as we really want the service to be as lightweight and simple as possible as we see this asa good entry point into new developers.

Sadly, it seems like you're recommending forking the library instead of allowing people to use local LLMs. You were smart enough to lock the PR from any further conversation at least :)

replies(1): >>42551027 #
20. roseway4 ◴[] No.42551027{4}[source]
You can override the default OpenAI url using an environment variable (iirc, OPENAI_API_BASE). Any LLM provider / inference server offering an OpenAI-compatible API will work.
replies(1): >>42552851 #
21. dmezzetti ◴[] No.42551738[source]
txtai automatically builds graphs using vector similarly as data is loaded. Another option is to use something like GLiNER and create entities on the fly. And then create relationships between those entities and/or documents. Or you can do both.

https://neuml.hashnode.dev/advanced-rag-with-graph-path-trav...

22. cratermoon ◴[] No.42552272[source]
This has always been the Hard Problem. For one, constructing an ontology that is comprehensive, flexible, and stable is huge effort. Then, taking the unstructured mess of documents and categorizing them is an entire industry in itself. Librarians have cataloging as a sub-specialty of library sciences devoted to this.

So yes, there's a huge pile of tools and software for working with knowledge graphs, but to date populating the graph is still the realm of human experts.

replies(1): >>42554016 #
23. diggan ◴[] No.42552851{5}[source]
Granted they use the `openai` python library (or other library/implementation that uses that same env var), hence my question in the previous-previous comment...
24. cyanydeez ◴[] No.42554016[source]
When you boil it down, the current LLMs could work effectively if a prompt engineer could figure out a converging loop of a librarian tasked with generating a hypertext web ring crossed with a wikipedia.

Perhaps one needs to manually create a starting point then ask the LLM to propse links to various documents or follow an existing one.

Sufficiently loopable transversal should create a KG

replies(1): >>42554408 #
25. elbi ◴[] No.42554318[source]
Are you creating first the kg or using llm to do so?
26. cratermoon ◴[] No.42554408{3}[source]
Oh yes. ( nods wisely)
27. ganeshkrishnan ◴[] No.42555303[source]
>uses OpenAI for LLM inference and embedding

This becomes a cyclical hallucination problem. The LLM hallucinates and create incorrect graph which in turn creates even more incorrect knowledge.

We are working on this issue of reducing hallucination in knowledge graphs and using LLM is not at all the right way.

replies(1): >>42585953 #
28. dramebaaz ◴[] No.42555979[source]
Excited to try it! Been looking for a temporally-aware way of creating a KG for my journal dataset
29. mikestaub ◴[] No.42562692[source]
https://github.com/HKUDS/LightRAG is pretty good
30. lunatuna ◴[] No.42563165[source]
There are several ontologies already well built out. Utilities and pharma both have them as an example. They are built by committee of vendors and users. They take a bit to penetrate the approach and language used. Often they are built to be adaptable.

I’ve had good success with CIM for Utilities to build a network graph for modelling the distribution and transmission networks adding sensor and event data for monitoring and analysis about 15 years ago.

Anywhere there is a technology focussed consortium of vendors and users building standards you will likely find a prebuilt graph. When RDF was “hot” many of the these groups spun out some attempt to model their domain.

In summary, if you need one look for one. Maybe there’s one waiting for you and you get to do less convincing and more doing.

31. sc077y ◴[] No.42585953{3}[source]
Actually the rate of hallucination is not constant across the board. For one you're doing a sort of synthesis, not intense reasoning or retrieval with the llm. Second, the problem is segmented into sub problems much like how gpt-o1 or o3 does using CoT. Thus, the risk of hallucinations is significantly lower compared to a zero-shot raw LLM or even a naive RAG approach.