←back to thread

357 points ingve | 1 comments | | HN request time: 0s | source
Show context
andrethegiant ◴[] No.43974436[source]
Cloudflare’s ai.toMarkdown() function available in Workers AI can handle PDFs pretty easily. Judging from speed alone, it seems they’re parsing the actual content rather than shoving into OCR/LLM.

Shameless plug: I use this under the hood when you prefix any PDF URL with https://pure.md/ to convert to raw text.

replies(4): >>43974514 #>>43974535 #>>43974602 #>>43975027 #
burkaman ◴[] No.43974514[source]
If you're looking for test cases, this is the first thing I tried and the result is very bad: https://pure.md/https://docs.house.gov/meetings/IF/IF00/2025...
replies(2): >>43974625 #>>43974927 #
marginalia_nu ◴[] No.43974927[source]
That PDF actually has some weird corner cases.

First it's all the same font size everywhere, it's also got bolded "headings" with spaces that are not bolded. Had to fix my own handling to get it to process well.

This is the search engine's view of the document as of those fixes: https://www.marginalia.nu/junk/congress.html

Still far from perfect...

replies(1): >>43975769 #
1. mdaniel ◴[] No.43975769[source]
> That PDF actually has some weird corner cases.

Heh, in my experience with PDFs that's a tautology