(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0s | source

Show context

andrethegiant ◴[13 May 25 16:05 UTC] No.43974436[source]▶

Cloudflare’s ai.toMarkdown() function available in Workers AI can handle PDFs pretty easily. Judging from speed alone, it seems they’re parsing the actual content rather than shoving into OCR/LLM.

Shameless plug: I use this under the hood when you prefix any PDF URL with https://pure.md/ to convert to raw text.

replies(4): >>43974514 #>>43974535 #>>43974602 #>>43975027 #

burkaman ◴[13 May 25 16:12 UTC] No.43974514[source]▶

>>43974436 #

If you're looking for test cases, this is the first thing I tried and the result is very bad: https://pure.md/https://docs.house.gov/meetings/IF/IF00/2025...

replies(2): >>43974625 #>>43974927 #

marginalia_nu ◴[13 May 25 16:47 UTC] No.43974927[source]▶

>>43974514 #

That PDF actually has some weird corner cases.

First it's all the same font size everywhere, it's also got bolded "headings" with spaces that are not bolded. Had to fix my own handling to get it to process well.

This is the search engine's view of the document as of those fixes: https://www.marginalia.nu/junk/congress.html

Still far from perfect...

replies(1): >>43975769 #

1. mdaniel ◴[13 May 25 17:56 UTC] No.43975769[source]▶

>>43974927 #

> That PDF actually has some weird corner cases.

Heh, in my experience with PDFs that's a tautology

↑

PDF to Text, a challenging problem