A graph explorer of the Epstein emails

(epstein-doc-explorer-1.onrender.com)

322 points cratermoon | 1 comments | 15 Nov 25 07:27 UTC | HN request time: 0s | source

https://github.com/maxandrews/Epstein-doc-explorer

Show context

pickpuck ◴[17 Nov 25 21:16 UTC] No.45958412[source]▶

What if we extended this idea beyond one dataset to all discrete news events and entities: people, organizations, places.

Just like here you could get a timeline of key events, a graph of connected entities, links to original documents.

Newsrooms might already do this internally idk.

This code might work as a foundation. I love that it's RDF.

replies(10): >>45958506 #>>45958629 #>>45959158 #>>45959273 #>>45959323 #>>45959385 #>>45960015 #>>45960134 #>>45960357 #>>45963779 #

jandrewrogers ◴[17 Nov 25 22:50 UTC] No.45959323[source]▶

>>45958412 #

This has been attempted many times. They all fail the same way.

These general data models start to become useful and interesting at around a trillion edges, give or take an order of magnitude. A mature graph model would be at least a few orders of magnitude larger, even if you aggressively curated what went into it. This is a simple consequence of the cardinality of the different kinds of entities that are included in most useful models.

No system described in open source can get anywhere close to even the base case of a trillion edges. They will suffer serious scaling and performance issues long before they get to that point. It is a famously non-trivial computer science problem and much of the serious R&D was not done in public historically.

This is why you only see toy or narrowly focused graph data models instead of a giant graph of All The Things. It would be cool to have something like this but that entails some hardcore deep tech R&D.

replies(5): >>45959382 #>>45960002 #>>45960262 #>>45960362 #>>45961019 #

1. michelpp ◴[18 Nov 25 00:19 UTC] No.45960002[source]▶

>>45959323 #

There are open source projects moving toward this scale, the GraphBLAS for example uses an algebraic formulation over compressed sparse matrix representations for graphs that is designed to be portable across many architectures, including cuda. It would be nice if companies like nivida could get more behind our efforts, as our main bottleneck is development hardware access.

To plug my project, I've wrapped the SuiteSparse GraphBLAS library in a postgres extension [1] that fluidly blends algebraic graph theory with the relational model, the main flow is to use sql to structure complex queries for starting points, and then use the graphblas to flow through the graph to the endpoints, then joining back to tables to get the relevant metadata. On cheap hetzner hardware (amd epyc 64 core) we've achieved 7 billion edges per second BFS over the largest graphs in the suitesparse collection (~10B edges). With our cuda support we hope to push that kind of performance into graphs with trillions of edges.

[1] https://github.com/OneSparse/OneSparse

↑