PDF to Text, a challenging problem

(www.marginalia.nu)

357 points ingve | 1 comments | 13 May 25 15:01 UTC | HN request time: 0s | source

Show context

dwheeler ◴[13 May 25 16:22 UTC] No.43974621[source]▶

The better solution is to embed, in the PDF, the editable source document. This is easily done by LibreOffice. Embedding it takes very little space in general (because it compresses well), and then you have MUCH better information on what the text is and its meaning. It works just fine with existing PDF readers.

replies(5): >>43974667 #>>43974983 #>>43975217 #>>43975401 #>>43976216 #

kerkeslager ◴[13 May 25 16:51 UTC] No.43974983[source]▶

>>43974621 #

That's true, but it's dependent on the creator of the PDF having aligned incentives with the consumer of the PDF.

In the e-Discovery field, it's commonplace for those providing evidence to dump it into a PDF purely so that it's harder for the opposing side's lawyers to consume. If both sides have lots of money this isn't a barrier, but for example public defenders don't have funds to hire someone (me!) to process the PDFs into a readable format, so realistically they end up taking much longer to process the data, which takes a psychological toll on the defendant. And that's if they process the data at all.

The solution is to make it illegal to do this: wiretap data, for example, should be provided in a standardized machine-readable format. There's no ethical reason for simple technical friction to be affecting the outcomes of criminal proceedings.

replies(2): >>43975362 #>>43979486 #

giovannibonetti ◴[13 May 25 17:21 UTC] No.43975362[source]▶

>>43974983 #

I wonder if AI will solve that

replies(1): >>43975985 #

GaggiX ◴[13 May 25 18:17 UTC] No.43975985[source]▶

>>43975362 #

There are specialized models, but even generic ones like Gemini 2.0 Flash are really good and cheap, you can use them and embed the OCR inside the PDF to index to the original content.

replies(1): >>43976339 #

1. kerkeslager ◴[13 May 25 18:53 UTC] No.43976339[source]▶

>>43975985 #

This fundamentally misunderstands the problem. Effective OCR predates the popularity of ChatGPT and e-Discovery folks were already using it--AI in the modern sense adds nothing to this. Indexing the resulting text was also already possible--again AI adds nothing. The problem is that the resultant text lacks structure: being able to sort/filter wiretap data by date/location, for example, isn't inherently possible because you've obtained text or indexed it. AI accuracy simply isn't high enough to solve this problem without specialized training--off the shelf models simply won't work accurately enough even if you can get around the legal problems of feeding potentially-sensitive information into a model. AI models trained on a large enough domain-specific dataset might work, but the existing off-the-shelf models certainly are not accurate enough. And there are a lot of subdomains--wiretap data, cell phone GPS data, credit card data, email metadata, etc., which would each require model training.

Fundamentally, the solution to this problem is to not create it in the first place. There's no reason for there to be a structured data -> PDF -> AI -> structured data pipeline when we can just force people providing evidence to provide the structured data.

↑