←back to thread

1303 points serjester | 2 comments | | HN request time: 0.616s | source
1. anonu ◴[] No.42964875[source]
Ingesting PDFs accurately is a noble goal which will no doubt be solved as LLMs get better. However, I need to point out that the financial statement example used in the article already has a solution: iXBRL.

Many financial regulators require you to publish heavily marked up statements with iXBRL. These markups reveal nuances in the numbers that OCRing a post processed table will not understand.

Of course, financial documents are a narrow subset of the problem.

Maybe the problem is with PDF as a format: Unfortunately PDFs lose that meta information when they are built from source documents.

I can't help but feel that PDFs could probably be more portable as their acronym indicates.

replies(1): >>42964996 #
2. tomrod ◴[] No.42964996[source]
Just call out -- even better, this library (even in active development) is blowing every other SEC tool I've found out the of the water

https://github.com/dgunning/edgartools