A graph explorer of the Epstein emails

1. pickpuck ◴[17 Nov 25 21:16 UTC] No.45958412[source]▶

What if we extended this idea beyond one dataset to all discrete news events and entities: people, organizations, places.

Just like here you could get a timeline of key events, a graph of connected entities, links to original documents.

Newsrooms might already do this internally idk.

This code might work as a foundation. I love that it's RDF.

replies(10): >>45958506 #>>45958629 #>>45959158 #>>45959273 #>>45959323 #>>45959385 #>>45960015 #>>45960134 #>>45960357 #>>45963779 #

2. j-pb ◴[17 Nov 25 21:25 UTC] No.45958506[source]▶

>>45958412 (TP) #

If it's RDF it won't work as the foundation.

3. axus ◴[17 Nov 25 21:38 UTC] No.45958629[source]▶

>>45958412 (TP) #

One wonders what the US government agencies use.

replies(6): >>45958683 #>>45958701 #>>45958720 #>>45958779 #>>45959182 #>>45960509 #

4. abnercoimbre ◴[17 Nov 25 21:42 UTC] No.45958683[source]▶

>>45958629 #

I think you meant one shudders. And yeah, Snowden made it clear there's orders of magnitude more data than this graph explorer for them to sift through.

5. PaulHoule ◴[17 Nov 25 21:44 UTC] No.45958701[source]▶

>>45958629 #

Isn’t that what Palantir’s product is?

replies(1): >>45961008 #

6. cjohnson318 ◴[17 Nov 25 21:46 UTC] No.45958720[source]▶

>>45958629 #

They probably use Excel, maybe Microsoft Access.

replies(1): >>45959203 #

7. fancy_pantser ◴[17 Nov 25 21:52 UTC] No.45958779[source]▶

>>45958629 #

Software like i2 Analyst's Notebook.

8. VikingCoder ◴[17 Nov 25 22:31 UTC] No.45959158[source]▶

>>45958412 (TP) #

Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale

Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus

replies(2): >>45973755 #>>45986818 #

9. dboreham ◴[17 Nov 25 22:34 UTC] No.45959182[source]▶

>>45958629 #

Internet search engines have their origins in government projects fwiw. They had search engines before Alta Vista, used for searching data sets that pre-date the internet, and some of the people involved in those went to work on the original commercial search engines.

10. ToucanLoucan ◴[17 Nov 25 22:36 UTC] No.45959203{3}[source]▶

>>45958720 #

Microsoft Access form that connects via IIS to an Excel spreadsheet acting as a database. Also the server it's running on is sitting on a wooden table.

replies(1): >>45975729 #

11. FanaHOVA ◴[17 Nov 25 22:44 UTC] No.45959273[source]▶

>>45958412 (TP) #

One co trying: https://www.system.com

12. jandrewrogers ◴[17 Nov 25 22:50 UTC] No.45959323[source]▶

>>45958412 (TP) #

This has been attempted many times. They all fail the same way.

These general data models start to become useful and interesting at around a trillion edges, give or take an order of magnitude. A mature graph model would be at least a few orders of magnitude larger, even if you aggressively curated what went into it. This is a simple consequence of the cardinality of the different kinds of entities that are included in most useful models.

No system described in open source can get anywhere close to even the base case of a trillion edges. They will suffer serious scaling and performance issues long before they get to that point. It is a famously non-trivial computer science problem and much of the serious R&D was not done in public historically.

This is why you only see toy or narrowly focused graph data models instead of a giant graph of All The Things. It would be cool to have something like this but that entails some hardcore deep tech R&D.

replies(5): >>45959382 #>>45960002 #>>45960262 #>>45960362 #>>45961019 #

13. babelfish ◴[17 Nov 25 22:56 UTC] No.45959382[source]▶

>>45959323 #

I don't have any experience on graph modeling, but it seems like Neo4j should be able to support 1 trillion edges, based on this (admittedly marketing) post of theirs? https://neo4j.com/press-releases/neo4j-scales-trillion-plus-...

replies(1): >>45959854 #

14. johongo ◴[17 Nov 25 22:56 UTC] No.45959385[source]▶

>>45958412 (TP) #

Emil Eifrem (founder of Neo4j) has a talk about them doing this with the Panama papers

15. jandrewrogers ◴[17 Nov 25 23:56 UTC] No.45959854{3}[source]▶

>>45959382 #

The graph database market has a deserved reputation for carefully crafting scaling claims that are so narrowly qualified as to be inapplicable to anything real. If you aren't deep into the tech you'll likely miss it in the press releases. It is an industry-wide problem, I'm not trying to single out Neo4j here.

Using this press release as an example, if you pay attention to the details you'll notice that this graph has an anomalously low degree. That is, the graph is very weakly connected, lots of nodes and barely any edges. Typical graph data models have much higher connectivity than this. For example, the classic Graph500 benchmark uses an average degree of 16 to measure scale-out performance.

So why did they nerf the graph connectivity? One of the most fundamental challenges in scaling graphs is optimally cutting them into shards. Unlike most data models, no matter how you cut up the graph some edges will always span multiple shards, which becomes a nasty consistency problem in scale-out systems. Scaling this becomes exponentially harder the more highly connected the graph. So basically, they defined away the problem that makes graphs difficult to scale. They used a graph so weakly connected that they could kinda sorta make it work on a thousand(!) machines even though it is not representative of most real-world graph data models.

replies(1): >>45972450 #

16. michelpp ◴[18 Nov 25 00:19 UTC] No.45960002[source]▶

>>45959323 #

There are open source projects moving toward this scale, the GraphBLAS for example uses an algebraic formulation over compressed sparse matrix representations for graphs that is designed to be portable across many architectures, including cuda. It would be nice if companies like nivida could get more behind our efforts, as our main bottleneck is development hardware access.

To plug my project, I've wrapped the SuiteSparse GraphBLAS library in a postgres extension [1] that fluidly blends algebraic graph theory with the relational model, the main flow is to use sql to structure complex queries for starting points, and then use the graphblas to flow through the graph to the endpoints, then joining back to tables to get the relevant metadata. On cheap hetzner hardware (amd epyc 64 core) we've achieved 7 billion edges per second BFS over the largest graphs in the suitesparse collection (~10B edges). With our cuda support we hope to push that kind of performance into graphs with trillions of edges.

[1] https://github.com/OneSparse/OneSparse

17. afavour ◴[18 Nov 25 00:22 UTC] No.45960015[source]▶

>>45958412 (TP) #

The New York Times has an API that lets you query “tags” or “topics” and the articles associated with them:

https://developer.nytimes.com/docs/semantic-api-product/1/ov...

The Guardian has similar:

https://open-platform.theguardian.com/documentation/tag

Either or both could be an interesting starting point for something like that. I tried to find something for the BBC and was surprised they didn’t have anything. I would have figured public media would have been a great resource for this.

18. ggm ◴[18 Nov 25 00:43 UTC] No.45960134[source]▶

>>45958412 (TP) #

Given 6 degrees is rooted in reality, this means we can draw causal graphs from anyone (bad) to anyone (we don't like) and then invent specious reasons why it means "it's all connected, man"

That said, some networks of shorter paths than 6 are interesting. Right now, there's a 1:1 direct path from these documents to a bunch of people with an interest in confounding what evidentiary value they have in justice processes. That's more interesting to me, than what the documents say right now.

19. stevage ◴[18 Nov 25 01:05 UTC] No.45960262[source]▶

>>45959323 #

>These general data models start to become useful and interesting at around a trillion edges

That is a wild claim. Perhaps for some very specific definition of "useful and interesting"? This dataset is already interesting (hard to say whether it's useful) at a much tinier scale.

replies(2): >>45960301 #>>45960504 #

20. zozbot234 ◴[18 Nov 25 01:12 UTC] No.45960301{3}[source]▶

>>45960262 #

This is not a "general purpose data model", though. A better example would be Wikidata which at about 100M nodes and 1B edges (so orders of magnitude less than that 1T claim) is already enabling plenty of useful queries about all sorts of publicly-available data and entities.

21. Centigonal ◴[18 Nov 25 01:23 UTC] No.45960357[source]▶

>>45958412 (TP) #

Check out GDELT!

https://www.gdeltproject.org/

replies(2): >>45961995 #>>45961999 #

22. theteapot ◴[18 Nov 25 01:24 UTC] No.45960362[source]▶

>>45959323 #

> It would be cool to have something like this ..

Aren't LLMs something like this?

replies(1): >>45960459 #

23. djtango ◴[18 Nov 25 01:43 UTC] No.45960459{3}[source]▶

>>45960362 #

An LLM probabilistically produces tokens over its model which is why it can hallucinate whilst an actual graph model would not have that issue

24. jandrewrogers ◴[18 Nov 25 01:51 UTC] No.45960504{3}[source]▶

>>45960262 #

It was a widely observed heuristic going back to the days when the Semantic Web was trendy. The underlying reason is also obvious once stated.

Almost every non-trivial graph data model about the world is a graph of human relationships in the population. If not directly then by proxy. Population scale human relationship graphs commonly pencil out at roughly 1T edges, a function of the population size. It is also typically the highest cardinality entity. Even the purpose isn’t a human relationship graph, they all tend to have one tacitly embedded with the scale implied.

If you restrict the set of human entities, you either end up with big holes in the graph or it is a graph that is not generally interesting (like one limited to company employees).

The OP was talking about generalizing this to a graph of people, places, events, and organizations, which always has this property.

It is similar to the phenomenon that a vast number of seemingly unrelated statistics are almost perfectly correlated with GDP.

25. arthurcolle ◴[18 Nov 25 01:52 UTC] No.45960509[source]▶

>>45958629 #

Probably not particularly useful but GCHQ & NSA both have neat graph related repos

UK: https://github.com/gchq/Gaffer

US: https://github.com/NationalSecurityAgency/lemongraph

26. sswaner ◴[18 Nov 25 03:16 UTC] No.45961008{3}[source]▶

>>45958701 #

Pretty much, at least at the semantic layer. https://publish.obsidian.md/followtheidea/Content/AI/Ontolog...

27. mmooss ◴[18 Nov 25 03:18 UTC] No.45961019[source]▶

>>45959323 #

> It is a famously non-trivial computer science problem and much of the serious R&D was not done in public historically.

Could you point us to any public research on this issue? Or the history of the proprietary research? Just the names might help - maybe there are news articles, it's a section in someone's book, etc.

28. pbronez ◴[18 Nov 25 06:28 UTC] No.45961995[source]▶

>>45960357 #

Yup, this is a fantastic project and probably the most mature attempt at a global knowledge graph for contemporary news.

29. scotty79 ◴[18 Nov 25 06:29 UTC] No.45961999[source]▶

>>45960357 #

300 categories, 60 attributes ... Doesn't sound very high res.

30. pjc50 ◴[18 Nov 25 11:35 UTC] No.45963779[source]▶

>>45958412 (TP) #

Someone did one for (a small subset of) UK media. People were furious. https://brokenbottleboy.substack.com/p/mapped-out

31. babelfish ◴[18 Nov 25 21:34 UTC] No.45972450{4}[source]▶

>>45959854 #

Thanks for taking the time to respond! Inspired me to go read the Facebook TAO paper.

32. throwaway290 ◴[18 Nov 25 23:37 UTC] No.45973755[source]▶

>>45959158 #

...and of course it's in RDF!

33. cjohnson318 ◴[19 Nov 25 04:04 UTC] No.45975729{4}[source]▶

>>45959203 #

Bro you can't just leak operational secrets on the world wide information highway like this.

34. darth_aardvark ◴[19 Nov 25 23:35 UTC] No.45986818[source]▶

>>45959158 #

Palantir, arguably the closest thing to Torment Nexus Inc. IRL, literally builds a product that does this.