PDF to Text, a challenging problem

1. dwheeler ◴[13 May 25 16:22 UTC] No.43974621[source]▶

The better solution is to embed, in the PDF, the editable source document. This is easily done by LibreOffice. Embedding it takes very little space in general (because it compresses well), and then you have MUCH better information on what the text is and its meaning. It works just fine with existing PDF readers.

replies(5): >>43974667 #>>43974983 #>>43975217 #>>43975401 #>>43976216 #

2. layer8 ◴[13 May 25 16:26 UTC] No.43974667[source]▶

>>43974621 (TP) #

That’s true, but it also opens up the vulnerability of the source document being arbitrarily different from the rendered PDF content.

3. kerkeslager ◴[13 May 25 16:51 UTC] No.43974983[source]▶

>>43974621 (TP) #

That's true, but it's dependent on the creator of the PDF having aligned incentives with the consumer of the PDF.

In the e-Discovery field, it's commonplace for those providing evidence to dump it into a PDF purely so that it's harder for the opposing side's lawyers to consume. If both sides have lots of money this isn't a barrier, but for example public defenders don't have funds to hire someone (me!) to process the PDFs into a readable format, so realistically they end up taking much longer to process the data, which takes a psychological toll on the defendant. And that's if they process the data at all.

The solution is to make it illegal to do this: wiretap data, for example, should be provided in a standardized machine-readable format. There's no ethical reason for simple technical friction to be affecting the outcomes of criminal proceedings.

replies(2): >>43975362 #>>43979486 #

4. carabiner ◴[13 May 25 17:09 UTC] No.43975217[source]▶

>>43974621 (TP) #

I bet 90% of the problem space is legacy PDFs. My company has thousands of these. Some are crappy scans. Some have Adobe's OCR embedded, but most have none at all.

replies(1): >>43977501 #

5. giovannibonetti ◴[13 May 25 17:21 UTC] No.43975362[source]▶

>>43974983 #

I wonder if AI will solve that

replies(1): >>43975985 #

6. lelandfe ◴[13 May 25 17:25 UTC] No.43975401[source]▶

>>43974621 (TP) #

The better solution to a search engine extracting text from existing PDFs is to provide advice on how to author PDFs?

What's the timeline for this solution to pay off

replies(1): >>43976378 #

7. GaggiX ◴[13 May 25 18:17 UTC] No.43975985{3}[source]▶

>>43975362 #

There are specialized models, but even generic ones like Gemini 2.0 Flash are really good and cheap, you can use them and embed the OCR inside the PDF to index to the original content.

replies(1): >>43976339 #

8. yxhuvud ◴[13 May 25 18:41 UTC] No.43976216[source]▶

>>43974621 (TP) #

Sure, and if you have access to the source document the pdf was generated from, then that is a good thing to do.

But generally speaking, you don't have that control.

9. kerkeslager ◴[13 May 25 18:53 UTC] No.43976339{4}[source]▶

>>43975985 #

This fundamentally misunderstands the problem. Effective OCR predates the popularity of ChatGPT and e-Discovery folks were already using it--AI in the modern sense adds nothing to this. Indexing the resulting text was also already possible--again AI adds nothing. The problem is that the resultant text lacks structure: being able to sort/filter wiretap data by date/location, for example, isn't inherently possible because you've obtained text or indexed it. AI accuracy simply isn't high enough to solve this problem without specialized training--off the shelf models simply won't work accurately enough even if you can get around the legal problems of feeding potentially-sensitive information into a model. AI models trained on a large enough domain-specific dataset might work, but the existing off-the-shelf models certainly are not accurate enough. And there are a lot of subdomains--wiretap data, cell phone GPS data, credit card data, email metadata, etc., which would each require model training.

Fundamentally, the solution to this problem is to not create it in the first place. There's no reason for there to be a structured data -> PDF -> AI -> structured data pipeline when we can just force people providing evidence to provide the structured data.

10. chaps ◴[13 May 25 18:58 UTC] No.43976378[source]▶

>>43975401 #

Microsoft is one of the bigger contributors to this. Like -- why does excel have a feature to export to PDF, but not a feature to do the opposite? That export functionality really feels like it was given to a summer intern who finished it in two weeks and never had to deal with it ever again.

replies(2): >>43978047 #>>43980433 #

11. ◴[13 May 25 20:44 UTC] No.43977501[source]▶

>>43975217 #

12. mattigames ◴[13 May 25 21:40 UTC] No.43978047{3}[source]▶

>>43976378 #

Because then we would have 2 formats: "pdfs generated by Excel" and "real pdfs" with the same extension and that would be it's own can of worms for Microsoft's and for everyone else.

replies(1): >>43986958 #

13. lurk2 ◴[14 May 25 00:38 UTC] No.43979486[source]▶

>>43974983 #

> The solution is to make it illegal to do this: wiretap data, for example, should be provided in a standardized machine-readable format. There's no ethical reason for simple technical friction to be affecting the outcomes of criminal proceedings.

I can’t speak to wiretaps specifically, but when it comes to the legal field, this is usually already how it operates. GDPR, for example, makes specific provisions that user data must be provided in an accessible, machine-readable format. Most jurisdictions also aren’t going to look kindly on physical document dumping and will require that documents be provided in a machine-readable format. PDF is the legal industry standard for all outbound files. The consistency of its formatting makes up for the difficulties involved with machine-readability.

There’s not a huge incentive to find an alternative because most firms will just charge a markup on the time a clerk spends reading through and transcribing those PDFs. If cost is a concern, though, most jurisdictions will require the party in possession of the original documents to provide them in a machine-readable format (e.g. providing bank records as Excel spreadsheets rather than as PDFs).

replies(1): >>43979952 #

14. kerkeslager ◴[14 May 25 02:02 UTC] No.43979952{3}[source]▶

>>43979486 #

I'm not sure I understand what you're saying? PDF isn't a machine-readable format for most kinds of data and keeping inherent court costs down is always a concern because it keeps the courts fair to the poor.

replies(1): >>43998316 #

15. bartread ◴[14 May 25 03:24 UTC] No.43980433{3}[source]▶

>>43976378 #

It does have a feature to do the opposite. You can, in theory, extract tabular data from PDFs with Excel (note: only on the Windows version; this function isn’t available in macOS Excel).

In practice I’ve found it to be extremely unreliable, and I suspect this may be because the optional metadata that semantically defines a table as a table is missing from the errant PDF. It’ll still look like a table when rendered, but there’s nothing that defines it as such. It’s just a bunch of graphical and text elements that, when rendered, happen to look like a table.

replies(1): >>43997734 #

16. chaps ◴[14 May 25 17:15 UTC] No.43986958{4}[source]▶

>>43978047 #

Hah, no. We would be going from 200,000 formats to 200,001 formats. Begone, shallow xkcd references!

17. chaps ◴[15 May 25 18:16 UTC] No.43997734{4}[source]▶

>>43980433 #

Yeah. The "extremely unreliable" part of that is the stinker. Some of the exports I get through FOIA are thousands and thousands of pages, so the unreliability really compounds really quickly. It's frustrating, because there are many things Microsoft could do with PDFs to make that a non-problem. But it's consistently been a naive implementation that doesn't consider newlines.

18. lurk2 ◴[15 May 25 19:14 UTC] No.43998316{4}[source]▶

>>43979952 #

I’m saying that most jurisdictions likely already do require data to be machine-readable, but when you run into PDFs, it isn’t a document dump (which courts don’t look kindly upon), but is instead a product of mixed parts convention and motivated laziness.

replies(1): >>44001971 #

19. kerkeslager ◴[16 May 25 04:58 UTC] No.44001971{5}[source]▶

>>43998316 #

You're saying two mutually exclusive things. Either it's required to be machine readable or it's PDF: it can't be both.