PDF to Text, a challenging problem

1. andrethegiant ◴[13 May 25 16:05 UTC] No.43974436[source]▶

Cloudflare’s ai.toMarkdown() function available in Workers AI can handle PDFs pretty easily. Judging from speed alone, it seems they’re parsing the actual content rather than shoving into OCR/LLM.

Shameless plug: I use this under the hood when you prefix any PDF URL with https://pure.md/ to convert to raw text.

replies(4): >>43974514 #>>43974535 #>>43974602 #>>43975027 #

2. burkaman ◴[13 May 25 16:12 UTC] No.43974514[source]▶

>>43974436 (TP) #

If you're looking for test cases, this is the first thing I tried and the result is very bad: https://pure.md/https://docs.house.gov/meetings/IF/IF00/2025...

replies(2): >>43974625 #>>43974927 #

3. _boffin_ ◴[13 May 25 16:14 UTC] No.43974535[source]▶

>>43974436 (TP) #

You’re aware that PDFs are containers that can hold various formats, which can be interlaced in different ways, such as on top, throughout, or in unexpected and unspecified ways that aren’t “parsable,” right?

I would wager that they’re using OCR/LLM in their pipeline.

replies(1): >>43974640 #

4. cpursley ◴[13 May 25 16:20 UTC] No.43974602[source]▶

>>43974436 (TP) #

How's their function do on complex data tables, charts and that sort of stuff?

5. andrethegiant ◴[13 May 25 16:22 UTC] No.43974625[source]▶

>>43974514 #

Apart from lacking newlines, how is the result bad? It extracts the text for easy piping into an LLM.

replies(1): >>43974913 #

6. andrethegiant ◴[13 May 25 16:23 UTC] No.43974640[source]▶

>>43974535 #

Could be. But their pricing for the conversion is free, which leads me to believe LLMs are not involved.

7. burkaman ◴[13 May 25 16:46 UTC] No.43974913{3}[source]▶

>>43974625 #

- Most of the titles have incorrectly split words, for example "P ART 2—R EPEAL OF EPA R ULE R ELATING TO M ULTI -P OLLUTANT E MISSION S TANDARDS". I know LLMs are resilient against typos and mistakes like this, but it still seems not ideal.

- The header is parsed in a way that I suspect would mislead an LLM: "BRETT GUTHRIE, KENTUCKY FRANK PALLONE, JR., NEW JERSEY CHAIRMAN RANKING MEMBER ONE HUNDRED NINETEENTH CONGRESS". Guthrie is the chairman and Pallone is the ranking member, but that isn't implied in the text. In this particular case an LLM might already know that from other sources, but in more obscure contexts it will just have to rely on the parsed text.

- It isn't converted into Markdown at all, the structure is completely lost. If you only care about text then I guess that's fine, and in this case an LLM might do an ok job at identifying some of the headers, but in the context of this discussion I think ai.toMarkdown() did a bad job of converting to Markdown and a just ok job of converting to text.

I would have considered this a fairly easy test case, so it would make me hesitant to trust that function for general use if I were trying to solve the challenges described in the submitted article (Identifying headings, Joining consecutive headings, Identifying Paragraphs).

I see that you are trying to minimize tokens for LLM input, so I realize your goals are probably not the same as what I'm talking about.

Edit: Another test case, it seems to crash on any Arxiv PDF. Example: https://pure.md/https://arxiv.org/pdf/2411.12104.

replies(1): >>43976034 #

8. marginalia_nu ◴[13 May 25 16:47 UTC] No.43974927[source]▶

>>43974514 #

That PDF actually has some weird corner cases.

First it's all the same font size everywhere, it's also got bolded "headings" with spaces that are not bolded. Had to fix my own handling to get it to process well.

This is the search engine's view of the document as of those fixes: https://www.marginalia.nu/junk/congress.html

Still far from perfect...

replies(1): >>43975769 #

9. bambax ◴[13 May 25 16:55 UTC] No.43975027[source]▶

>>43974436 (TP) #

It doesn't seem to handle multi-columns PDFs well?

10. mdaniel ◴[13 May 25 17:56 UTC] No.43975769{3}[source]▶

>>43974927 #

> That PDF actually has some weird corner cases.

Heh, in my experience with PDFs that's a tautology

11. andrethegiant ◴[13 May 25 18:22 UTC] No.43976034{4}[source]▶

>>43974913 #

> it seems to crash on any Arxiv PDF

Fixed, thanks for reporting :-)