A graph explorer of the Epstein emails

(epstein-doc-explorer-1.onrender.com)

322 points cratermoon | 1 comments | 15 Nov 25 07:27 UTC | HN request time: 0.219s | source

https://github.com/maxandrews/Epstein-doc-explorer

Show context

pickpuck ◴[17 Nov 25 21:16 UTC] No.45958412[source]▶

What if we extended this idea beyond one dataset to all discrete news events and entities: people, organizations, places.

Just like here you could get a timeline of key events, a graph of connected entities, links to original documents.

Newsrooms might already do this internally idk.

This code might work as a foundation. I love that it's RDF.

replies(10): >>45958506 #>>45958629 #>>45959158 #>>45959273 #>>45959323 #>>45959385 #>>45960015 #>>45960134 #>>45960357 #>>45963779 #

jandrewrogers ◴[17 Nov 25 22:50 UTC] No.45959323[source]▶

>>45958412 #

This has been attempted many times. They all fail the same way.

These general data models start to become useful and interesting at around a trillion edges, give or take an order of magnitude. A mature graph model would be at least a few orders of magnitude larger, even if you aggressively curated what went into it. This is a simple consequence of the cardinality of the different kinds of entities that are included in most useful models.

No system described in open source can get anywhere close to even the base case of a trillion edges. They will suffer serious scaling and performance issues long before they get to that point. It is a famously non-trivial computer science problem and much of the serious R&D was not done in public historically.

This is why you only see toy or narrowly focused graph data models instead of a giant graph of All The Things. It would be cool to have something like this but that entails some hardcore deep tech R&D.

replies(5): >>45959382 #>>45960002 #>>45960262 #>>45960362 #>>45961019 #

babelfish ◴[17 Nov 25 22:56 UTC] No.45959382[source]▶

>>45959323 #

I don't have any experience on graph modeling, but it seems like Neo4j should be able to support 1 trillion edges, based on this (admittedly marketing) post of theirs? https://neo4j.com/press-releases/neo4j-scales-trillion-plus-...

replies(1): >>45959854 #

jandrewrogers ◴[17 Nov 25 23:56 UTC] No.45959854[source]▶

>>45959382 #

The graph database market has a deserved reputation for carefully crafting scaling claims that are so narrowly qualified as to be inapplicable to anything real. If you aren't deep into the tech you'll likely miss it in the press releases. It is an industry-wide problem, I'm not trying to single out Neo4j here.

Using this press release as an example, if you pay attention to the details you'll notice that this graph has an anomalously low degree. That is, the graph is very weakly connected, lots of nodes and barely any edges. Typical graph data models have much higher connectivity than this. For example, the classic Graph500 benchmark uses an average degree of 16 to measure scale-out performance.

So why did they nerf the graph connectivity? One of the most fundamental challenges in scaling graphs is optimally cutting them into shards. Unlike most data models, no matter how you cut up the graph some edges will always span multiple shards, which becomes a nasty consistency problem in scale-out systems. Scaling this becomes exponentially harder the more highly connected the graph. So basically, they defined away the problem that makes graphs difficult to scale. They used a graph so weakly connected that they could kinda sorta make it work on a thousand(!) machines even though it is not representative of most real-world graph data models.

replies(1): >>45972450 #

1. babelfish ◴[18 Nov 25 21:34 UTC] No.45972450[source]▶

>>45959854 #

Thanks for taking the time to respond! Inspired me to go read the Facebook TAO paper.

↑