←back to thread

357 points ingve | 1 comments | | HN request time: 0.22s | source
Show context
wrs ◴[] No.43974336[source]
Since these are statistical classification problems, it seems like it would be worth trying some old-school machine learning (not an LLM, just an NN) to see how it compares with these manual heuristics.
replies(1): >>43974445 #
marginalia_nu ◴[] No.43974445[source]
I imagine that would work pretty well given an adequate and representative body of annotated sample data. Though that is also not easy to come by.
replies(2): >>43975438 #>>43980893 #
1. wrs ◴[] No.43980893[source]
But if you believe in your manual heuristics enough to ship them, you must already have a body of tests that you're happy with, right?

Also seems like this is a case where generating synthetic data would be a big help. You don't have to use only real-world documents for training, just examples of the sorts of things real-world documents have in them. Make a vast corpus of semi-random documents in semi-random fonts and settings, printed from Word, Pandoc, LaTeX, etc.