(www.marginalia.nu)

357 points ingve | 4 comments | 13 May 25 15:01 UTC | HN request time: 0.672s | source

1. wrs ◴[13 May 25 15:57 UTC] No.43974336[source]▶

Since these are statistical classification problems, it seems like it would be worth trying some old-school machine learning (not an LLM, just an NN) to see how it compares with these manual heuristics.

replies(1): >>43974445 #

2. marginalia_nu ◴[13 May 25 16:06 UTC] No.43974445[source]▶

>>43974336 (TP) #

I imagine that would work pretty well given an adequate and representative body of annotated sample data. Though that is also not easy to come by.

replies(2): >>43975438 #>>43980893 #

3. ted_dunning ◴[13 May 25 17:28 UTC] No.43975438[source]▶

>>43974445 #

Actually, it is easy to come up with reasonably decent heuristics that can auto-tag a corpus. From that you can look for anomalies and adjust your tagging system.

The problem of getting a representative body is (surprisingly) much harder than the annotation. I know. I spent quite some time years ago doing this.

4. wrs ◴[14 May 25 04:41 UTC] No.43980893[source]▶

>>43974445 #

But if you believe in your manual heuristics enough to ship them, you must already have a body of tests that you're happy with, right?

Also seems like this is a case where generating synthetic data would be a big help. You don't have to use only real-world documents for training, just examples of the sorts of things real-world documents have in them. Make a vast corpus of semi-random documents in semi-random fonts and settings, printed from Word, Pandoc, LaTeX, etc.

↑

PDF to Text, a challenging problem