Ingesting PDFs and why Gemini 2.0 changes everything

(www.sergey.fyi)

1303 points serjester | 2 comments | 05 Feb 25 18:05 UTC | HN request time: 1.34s | source

Show context

scottydelta ◴[05 Feb 25 19:07 UTC] No.42953493[source]▶

>>42952605 (OP) #

This is what I am trying to figure out how to solve.

My problem statement is:

- Injest PDFs, summarize, and extract important information.

- Have some way to overlay the extracted information on the pdf in the UI.

- User can provide feedback on the overlaid info by accepting or rejecting the highlights as useful or not.

- This info goes back in to the model for reinforced learning.

Hoping to find something that can make this more manageable.

replies(2): >>42953630 #>>42953907 #

1. cccybernetic ◴[05 Feb 25 19:36 UTC] No.42953907[source]▶

>>42953493 #

Most PDF parsers give you coordinate data (bounding boxes) for extracted text. Use these to draw highlights over your PDF viewer - users can then click the highlights to verify if the extraction was correct.

The tricky part is maintaining a mapping between your LLM extractions and these coordinates.

One way to do it would be with two LLM passes:

  1. First pass: Extract all important information from the PDF
  2. Second pass: "Hey LLM, find where each extraction appears in these bounded text chunks"

Not the cheapest approach since you're hitting the API twice, but it's straightforward!

replies(1): >>42954090 #

2. Jimmc414 ◴[05 Feb 25 19:49 UTC] No.42954090[source]▶

>>42953907 (TP) #

Here's a PR thats not accepted yet for some reason that seems to be having some success with the bounding boxes

https://github.com/getomni-ai/zerox/pull/44

↑