←back to thread

357 points ingve | 1 comments | | HN request time: 0.22s | source
Show context
rad_gruchalski ◴[] No.43974057[source]
So many of these problems have been solved by mozilla pdf.js together with its viewer implementation: https://mozilla.github.io/pdf.js/.
replies(3): >>43974240 #>>43974428 #>>43975184 #
egnehots ◴[] No.43974428[source]
I don't think so, pdf.js is able to render a pdf content.

Which is different from extracting "text". Text in PDF can be encoded in many ways, in an actual image, in shapes (think, segments, quadratic bezier curves...), or in an XML format (really easy to process).

PDF viewers are able to render text, like a printer would work, processing command to show pixels on the screen at the end.

But often, paragraph, text layout, columns, tables are lost in the process. Even though, you see them, so close yet so far. That is why AI is quite strong at this task.

replies(2): >>43974734 #>>43975239 #
1. rad_gruchalski ◴[] No.43975239[source]
You are wrong. Pdf.js can extract text and has all facilities required to render and extract formatting. The latest version can also edit PDF files. It’s basically the same engine as the Firefox PDF viewer. Which also has a document outline, search, linking, print preview, scaling, scripting sandbox… it does not simply „render” a file.

Regarding tables, this here https://www.npmjs.com/package/pdf-table-extractor does a very good job at table interpretation and works on top of pdf.js.

I also didn’t say what works better or worse, neither do I go into PDF being good or bad.

I simply said that a ton of problems have been covered by