Ingesting PDFs and why Gemini 2.0 changes everything

1. twelve40 ◴[05 Feb 25 23:11 UTC] No.42956704[source]▶

Isn't it amazing how one company invented a universally spread format that takes structured data from an editor (except images obviously) and converts it into a completely fucked-up unstructured form that then requires expensive voodoo magic to convert back into structured data.

replies(6): >>42956861 #>>42956872 #>>42960357 #>>42961573 #>>42963880 #>>42967188 #

2. sconeguy ◴[05 Feb 25 23:27 UTC] No.42956861[source]▶

>>42956704 (TP) #

What would it have taken to store the plain text in some meta field in the document. Argh, so annoying.

replies(2): >>42957625 #>>42957655 #

3. shermantanktop ◴[05 Feb 25 23:28 UTC] No.42956872[source]▶

>>42956704 (TP) #

is "put this glyph at coordinate (x,y)" really what you'd call "structured"?

replies(2): >>42957640 #>>42959081 #

4. dkjaudyeqooe ◴[06 Feb 25 00:59 UTC] No.42957625[source]▶

>>42956861 #

PDF provide that capability, but editors don't produce it, probably because printing is though OS drivers that don't support it, or PDF generators that don't support it. Or they do support it but users don't know to check that option, or turn it off because it makes PDFs too large.

replies(1): >>42958861 #

5. dkjaudyeqooe ◴[06 Feb 25 01:01 UTC] No.42957640[source]▶

>>42956872 #

He's calling PDFs unstructured: structured editors -> unstructured PDF -> structured data

6. groby_b ◴[06 Feb 25 01:04 UTC] No.42957655[source]▶

>>42956861 #

PDF supports that just fine. It's just that many PDF publishers choose not to use that.

You can lead a horse to water...

7. user_7832 ◴[06 Feb 25 03:51 UTC] No.42958861{3}[source]▶

>>42957625 #

Do you know what this field/type is called, and I’d any of the big names (MS/Adobe etc) support creating such PDFs?

replies(2): >>42959942 #>>42960671 #

8. irjustin ◴[06 Feb 25 04:27 UTC] No.42959081[source]▶

>>42956872 #

It's not the structure that allows meaningful understanding.

Something that was clearly a table now becomes a bunch of glphy's physically close to eachother vs a group of other glphys but when considered as a group is a box visually separated from another group of glphys but actually part of a table.

9. bux93 ◴[06 Feb 25 07:19 UTC] No.42959942{4}[source]▶

>>42958861 #

OCR software like ABBY can spit out something called a "searchable PDF", which has a text layer underneath a picture of a scan. Otherwise, PDF has 'dictionaries' with arbitrary key-value pairs in them. The "Info" dictionary has some specific metadata fields like Author, and a "Font" dictionary embeds fonts, but you're free to use those dictionaries for whatever. There's also a standard to embed 'dublin core', rights management and custom metadata called XMP. Files can be embedded. You can also use comments, as PDF is a subset of postscript. When a PDF gets converted to PDF/A (by archiving software) or flattened/optimized, most of these are likely to be lost.

10. nitwit005 ◴[06 Feb 25 08:39 UTC] No.42960357[source]▶

>>42956704 (TP) #

People kind of dump whatever in pdf files, so I don't think a cleaner file format would do as much as you might think.

Digital fax services will generate pdf files, for example. They're just image data dumped into a pdf. Various scanners will also do so.

11. dkjaudyeqooe ◴[06 Feb 25 09:31 UTC] No.42960671{4}[source]▶

>>42958861 #

I believe it's a "hybrid PDF" but I'm not sure if there's a further standard for merely embedding text.

https://stackoverflow.com/questions/67358370/what-the-standa...

12. surfingdino ◴[06 Feb 25 12:02 UTC] No.42961573[source]▶

>>42956704 (TP) #

In my experience AWS Textextract does a pretty good job without using LLMs.

13. shermantanktop ◴[06 Feb 25 16:19 UTC] No.42963880[source]▶

>>42956704 (TP) #

PDFs began as just postscript commands stored in a file. It’s a genius hack in a way that has become a Frankenstein’s monster.

14. Bluestein ◴[06 Feb 25 22:33 UTC] No.42967188[source]▶

>>42956704 (TP) #

... and call's it "portable", to boot.-