Since these are statistical classification problems, it seems like it would be worth trying some old-school machine learning (not an LLM, just an NN) to see how it compares with these manual heuristics.
replies(1):
The problem of getting a representative body is (surprisingly) much harder than the annotation. I know. I spent quite some time years ago doing this.
Also seems like this is a case where generating synthetic data would be a big help. You don't have to use only real-world documents for training, just examples of the sorts of things real-world documents have in them. Make a vast corpus of semi-random documents in semi-random fonts and settings, printed from Word, Pandoc, LaTeX, etc.