←back to thread

A graph explorer of the Epstein emails

(epstein-doc-explorer-1.onrender.com)
322 points cratermoon | 9 comments | | HN request time: 0.228s | source | bottom
1. liotier ◴[] No.45957667[source]
"Brad Edwards" and "Bradley Edwards" might be the same individual.
replies(5): >>45958446 #>>45958478 #>>45958562 #>>45958965 #>>45959536 #
2. GuinansEyebrows ◴[] No.45958446[source]
Likewise for instances of "Larry" and "Lawrence" Summers... probably a lot of those.
3. tovej ◴[] No.45958478[source]
Yes, the dataset also has three entries for Virginia Giuffre, "Virginia L. Giuffre", "Virginia Roberts Giuffre", and "Jane Doe Number 3 (Virginia Roberts)"
4. DrewADesign ◴[] No.45958562[source]
I’m sure some developer/archivist is working on a name authority as we speak.
5. cyrusradfar ◴[] No.45958965[source]
great use case for using AI to suggest mergers and clean up.
replies(1): >>45959160 #
6. specproc ◴[] No.45959160[source]
LLMs are awful for this. I've got a project that's doing structured extraction and half the work is deduplication.

I didn't go down the route of LLMs for the clean up, as you're getting into scale and context issues with larger datasets.

I got into semantic similarity networks for this use case. You can do efficient pairwise matching with Annoy, set a cutoff threshold, and your isolated subgraphs are merger candidates.

I wrapped up my code in a little library if you're into this sort of thing.

github.com/specialprocedures/semnet

replies(1): >>45963938 #
7. adolph ◴[] No.45959536[source]
I read a recent observation that people subject to discovery are often making purposeful typos in key names in order for the communication to remain under the radar.
replies(1): >>45964472 #
8. mvATM99 ◴[] No.45963938{3}[source]
Nice looking library! Might try it for one of my own projects.
9. potato3732842 ◴[] No.45964472[source]
Everyone is potentially subject to discovery. Some people are just more aware of it.