Most active commenters

sitkack(7)

Popular/hot comments

>>42960720 #
>>42962159 #
>>42962384 #

←back to thread

Ingesting PDFs and why Gemini 2.0 changes everything

(www.sergey.fyi)

Show context

lazypenguin ◴[05 Feb 25 19:19 UTC] No.42953665[source]▶

>>42952605 (OP) #

I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.

Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!

replies(33): >>42953680 #>>42953745 #>>42953799 #>>42954088 #>>42954472 #>>42955083 #>>42955470 #>>42955520 #>>42955824 #>>42956650 #>>42956937 #>>42957231 #>>42957551 #>>42957624 #>>42957905 #>>42958152 #>>42958534 #>>42958555 #>>42958869 #>>42959364 #>>42959695 #>>42959887 #>>42960847 #>>42960954 #>>42961030 #>>42961554 #>>42962009 #>>42963981 #>>42964161 #>>42965420 #>>42966080 #>>42989066 #>>43000649 #

1. kbyatnal ◴[06 Feb 25 00:46 UTC] No.42957551[source]▶

>>42953665 #

This is spot on, any legacy vendor focusing on a specific type of PDF is going to get obliterated by LLMs. The problem with using an off-the-shelf provider like this is, you get stuck with their data schema. With an LLM, you have full control over the schema meaning you can parse and extract much more unique data.

The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"

You could improve your accuracy further by adding some chain-of-thought to your prompt btw. e.g. Make each field in your json schema have a `reasoning` field beforehand so the model can CoT how it got to its answer. If you want to take it to the next level, `citations` in our experience also improves performance (and when combined with bounding boxes, is powerful for human-in-the-loop tooling).

Disclaimer: I started an LLM doc processing infra company (https://extend.app/)

replies(6): >>42960720 #>>42964598 #>>42971548 #>>42993825 #>>42999533 #>>43081041 #

2. TeMPOraL ◴[06 Feb 25 09:40 UTC] No.42960720[source]▶

>>42957551 (TP) #

> The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"

A smart vendor will shift into that space - they'll use that LLM themselves, and figure out some combination of finetunes, multiple LLMs, classical methods and human verification of random samples, that lets them not only "validate its performance, and deploy it with confidence into prod", but also sell that confidence with an SLA on top of it.

replies(4): >>42961020 #>>42962159 #>>42965369 #>>43072136 #

3. Cumpiler69 ◴[06 Feb 25 10:37 UTC] No.42961020[source]▶

>>42960720 #

>A smart vendor will shift into that space - they'll use that LLM themselves

It's a bit late to start shifting now since it takes time. Ideally they should already have a product on the market.

replies(2): >>42961190 #>>42961828 #

4. TeMPOraL ◴[06 Feb 25 11:01 UTC] No.42961190{3}[source]▶

>>42961020 #

There's still time. The situation in which you can effectively replace your OCR vendor with hitting LLM APIs via a half-assed Python script ChatGPT wrote for you, has existed for maybe few months. People are only beginning to realize LLMs got good enough that this is an option. An OCR vendor that starts working on the shift today, should easily be able to develop, tune, test and productize an LLM-based OCR pipeline way before most of their customers realize what's been happening.

But it is a good opportunity for a fast-moving OCR service to steal some customers from their competition. If I were working in this space, I'd be worried about that, and also about the possibility some of the LLM companies realize they could actually break into this market themselves right now, and secure some additional income.

EDIT:

I get the feeling that the main LLM suppliers are purposefully sticking to general-purpose APIs and refraining from competing with anyone on specific services, and that this goes beyond just staying focused. Some of potential applications, like OCR, could turn into money printers if they moved on them now, and they all could use some more cash to offset what they burn on compute. Is it because they're trying to avoid starting an "us vs. them" war until after they made everyone else dependent on them?

replies(2): >>42961652 #>>42965475 #

5. ◴[06 Feb 25 12:18 UTC] No.42961652{4}[source]▶

>>42961190 #

6. bayindirh ◴[06 Feb 25 12:47 UTC] No.42961828{3}[source]▶

>>42961020 #

Never underestimate the power of the second mover. Since the development is happening in the open, someone can quickly cobble up the information and cut directly to the 90% of the work.

Then your secret sauce will be your fine tunes, etc.

Like it or not AI/LLM will be a commodity, and this bubble will burst. Moats are hard to build when you have at least one open source copy of what you just did.

replies(1): >>42964485 #

7. sitkack ◴[06 Feb 25 13:38 UTC] No.42962159[source]▶

>>42960720 #

Software is dead, if it isn't a prompt now, it will be a prompt in 6 months.

Most of what we think software is today, will just be a UI. But UIs are also dead.

replies(4): >>42962384 #>>42963150 #>>42964153 #>>42965019 #

8. cpursley ◴[06 Feb 25 14:02 UTC] No.42962384{3}[source]▶

>>42962159 #

Software without data moats, vender lock-in, etc sure will. All the low handing fruit saas is going to get totally obliterated by LLM built-software.

replies(3): >>42964952 #>>42965226 #>>42983480 #

9. SketchySeaBeast ◴[06 Feb 25 15:16 UTC] No.42963150{3}[source]▶

>>42962159 #

I wonder about these takes. Have you never worked in a complex system in a large org before?

OK, sure, we can parse a PDF reliably now, but now we need to act on that data. We need to store it, make sure it ends up with the right people who need to be notified that the data is available for their review. They then need to make decisions upon that data, possible requiring input from multiple stakeholders.

All that back and forth needs to be recorded and stored, along with the eventual decision and the all supporting documents and that whole bundle needs to be made available across multiple systems, which requires a bunch of ETLs and governance.

An LLM with a prompt doesn't replace all that.

replies(1): >>42966251 #

10. victorbjorklund ◴[06 Feb 25 16:46 UTC] No.42964153{3}[source]▶

>>42962159 #

Can you prompt a salesforce replacement for an org with 100 000 employees?

replies(1): >>42966218 #

11. SoftTalker ◴[06 Feb 25 17:26 UTC] No.42964485{4}[source]▶

>>42961828 #

And next year your secret sauce will be worthless because the LLMs are that much better again.

Businesses that are just "today's LLM + our bespoke improvements" won't have legs.

12. panta ◴[06 Feb 25 17:35 UTC] No.42964598[source]▶

>>42957551 (TP) #

How do you handle the privacy of the scanned documents?

replies(2): >>42964693 #>>43069253 #

13. kbyatnal ◴[06 Feb 25 17:48 UTC] No.42964693[source]▶

>>42964598 #

We work with fortune 500s in sensitive industries (healthcare, fintech, etc). Our policies are:

- data is never shared between customers

- data never gets used for training

- we also configure data retention policies to auto-purge after a time period

replies(1): >>42965220 #

14. sitkack ◴[06 Feb 25 18:14 UTC] No.42964952{4}[source]▶

>>42962384 #

Totally agree.

15. ◴[06 Feb 25 18:23 UTC] No.42965019{3}[source]▶

>>42962159 #

16. panta ◴[06 Feb 25 18:41 UTC] No.42965220{3}[source]▶

>>42964693 #

But how to get these guarantees from the upstream vendors? Or do you run the LLMs on premises?

replies(1): >>42965561 #

17. fragmede ◴[06 Feb 25 18:42 UTC] No.42965226{4}[source]▶

>>42962384 #

If I'm an autobody shop or some other well-served niche, how unhappy with them do I have to be to decide to find a replacement, either a competitor of theirs that used an LLM, or bring it in house and go off and find a developer to LLM-acceleratedly make me a better shopmonkey? And there are the integrations. I don't own a low hanging fruit SaaS company, but it seems very sticky, and since the established company already exists, they can just lower prices to meet their competitors.

B2B is different from B2C, so if one vendor has a handful of clients and they won't switch away, there's no obliterating happening.

What's opened up is even lower hanging fruit, on more trees. A SaaS company charging $3/month for the left-handed underwater basket weaver niche now becomes viable as a lifestyle business. The shovels in this could be supabase/similar, since clients can keep access to their data there even if they change frontends.

replies(2): >>42966227 #>>42971785 #

18. wraptile ◴[06 Feb 25 18:59 UTC] No.42965369[source]▶

>>42960720 #

That's what we did with our web scraping saas - with Extraction API¹ we shifted web scraped data parsing to support both predefined models for common objects like products, reviews etc. and direct LLM prompts that we further optimize for flexible extraction.

There's definitely space here to help the customer realize their extraction vision because it's still hard to scale this effectively on your own!

1 - https://scrapfly.io/extraction-api

19. anon84873628 ◴[06 Feb 25 19:11 UTC] No.42965475{4}[source]▶

>>42961190 #

To the point after your edit, I view it like the cloud shift from IaaS to PaaS / SaaS. Start with a neutral infrastructure platform that attracts lots of service providers. Then take your pick of which ones to replicate with a vertically integrated competitor or manager offering once you are too big for anyone to really complain.

20. Karrot_Kream ◴[06 Feb 25 19:20 UTC] No.42965561{4}[source]▶

>>42965220 #

If you're using LLM APIs there are SLAs from the vendors to make sure your inputs are not used as training data and other guarantees. Generally these endpoints cost more to use (the compliance fee essentially) but they solve the problem.

21. mrbungie ◴[06 Feb 25 20:33 UTC] No.42966218{4}[source]▶

>>42964153 #

Yesterday I read an /r/singularity post in awe cus of a screenshot of a lead management platform from OAI in a japan convention supposedly meant a direct threat to SalesForce. Like, yeah sure buddy.

I would say most acceleracionist/AI bulls/etc don't really understand the true essential complexity in software development. LLMs are being seen as a software development silver bullets, and we know what happens with silver bullets.

replies(1): >>42966233 #

22. sitkack ◴[06 Feb 25 20:35 UTC] No.42966227{5}[source]▶

>>42965226 #

Which means that the current vc-software-ecosystem is the walking dead. The front end webdev is now going to do things that previously took a 10 person startup.

23. sitkack ◴[06 Feb 25 20:35 UTC] No.42966233{5}[source]▶

>>42966218 #

Come back your comment in 18 months.

replies(1): >>42966381 #

24. sitkack ◴[06 Feb 25 20:38 UTC] No.42966251{4}[source]▶

>>42963150 #

We need to think terms of light cones, not dog and pony take downs of whatever system you are currently running. See where thigns are going.

I have worked in large systems, both in code and people, compilers, massive data processing systems, 10k business units.

replies(2): >>42966352 #>>42967995 #

25. collingreen ◴[06 Feb 25 20:49 UTC] No.42966352{5}[source]▶

>>42966251 #

I don't know what light cones or dog and pony mean here but I'm interested in your take - would you care to expand a bit on how the future can reshape that very complicated set of steps and humans described in the parent?

26. collingreen ◴[06 Feb 25 20:51 UTC] No.42966381{6}[source]▶

>>42966233 #

I assume this is a slap intended to imply that ai actually IS a silver bullet answer to the parent's described problem and in just 18 months they will look back and realize how wrong they are.

Is that what you mean and, if so, is there anything in particular you've seen that leads you to see these problems being solved well or on the 18 month timeline? That sounds interesting to look at to me and I'd love to know more.

replies(1): >>42967637 #

27. sitkack ◴[06 Feb 25 23:41 UTC] No.42967637{7}[source]▶

>>42966381 #

It isn't a silver bullet in that it can just "make software" but it is changing the entire dynamic.

You can't do point sampling to figure out where things are going. We have to look at the slope. People see a paper come out, look at the results and say, "this fails for x, y and z. doesn't work", that is now how scientific research works. This is why two minute papers has the tag line, "hold on to your papers ... two papers down the line ..."

Copy and paste the whole thread into a SOTA model and have meta me explain it.

replies(1): >>42980359 #

28. SketchySeaBeast ◴[07 Feb 25 00:34 UTC] No.42967995{5}[source]▶

>>42966251 #

I think collingreen followed-up better than I ever could, so I'm hoping you can respond to them with more details.

29. montecruiseto ◴[07 Feb 25 11:44 UTC] No.42971548[source]▶

>>42957551 (TP) #

So why should I still use Extend instead of Gemini?

30. cpursley ◴[07 Feb 25 12:22 UTC] No.42971785{5}[source]▶

>>42965226 #

Integrations is part of the data moat I mentioned.

31. ethbr1 ◴[08 Feb 25 04:23 UTC] No.42980359{8}[source]▶

>>42967637 #

That's not why more experienced people are doubting you.

They're doubting you because the non-digital portions of processes change at people/org speed.

Which is to say that changing a core business process is a year political consensus, rearchitecture, and change management effort, because you also have to coordinate all the cascading and interfacing changes.

replies(1): >>42985568 #

32. Vrondi ◴[08 Feb 25 15:19 UTC] No.42983480{4}[source]▶

>>42962384 #

The only thing that will be different for most is vendor lock-in will be to LLM vendors.

33. sitkack ◴[08 Feb 25 19:38 UTC] No.42985568{9}[source]▶

>>42980359 #

> changing a core business process is a year political consensus, rearchitecture, and change management effort

You are thinking within the existing structures, those structures will evaporate. All along the software supply chain, processes will get upended, not just because of how technical assets will be created, but also how organizations themselves are structured and react and in turn how software is created and consumed.

This is as big as the invention of the corporation, the printing press and the industrial revolution.

I am not here to tutor people on this viewpoint or defend it, I offer it and everyone can do with it what they will.

replies(1): >>42986090 #

34. ethbr1 ◴[08 Feb 25 20:51 UTC] No.42986090{10}[source]▶

>>42985568 #

Ha. Look back on this comment in a few years.

35. MajorData ◴[09 Feb 25 20:50 UTC] No.42993825[source]▶

>>42957551 (TP) #

`How did you add bounding boxes, especially if it is variety of files?

replies(1): >>43069204 #

36. raghavsb ◴[10 Feb 25 12:33 UTC] No.42999533[source]▶

>>42957551 (TP) #

Great, I landed on the reasoning and citations bit through trial and error and the outputs improved for sure.

37. bitdribble ◴[16 Feb 25 16:23 UTC] No.43069204[source]▶

>>42993825 #

In my open source tool http://docrouter.ai I run both OCR and LLM/Gemini, using litellm to support multiple LLMs. The user can configure extraction schema & prompts, and use tags to select which prompt/llm combination runs on which uploaded PDF.

LLM extractions are searched in OCR output, and if matched, the bounding box is displayed based on OCR output.

Demo: app.github.ai (just register an account and try) Github: https://github.com/analytiq-hub/doc-router

Reach out to me at andrei@analytiqhub.com for questions. Am looking for feedback and collaborators.

38. bitdribble ◴[16 Feb 25 16:29 UTC] No.43069253[source]▶

>>42964598 #

With the docrouter.ai, it can be installed on prem. If using the SAAS version, users can collaborate in separate workspaces, modeled on how Databricks supports workspaces. Back end DB is Mongo, which keeps things simple.

One level of privacy is the workspace level separation in Mongo. But, if there is customer interest, other setups are possible. E.g. the way Databricks handles privacy is by actually giving each account its own back end services - and scoping workspaces within an account.

That is a good possible model.

39. quantumPilot ◴[16 Feb 25 21:58 UTC] No.43072136[source]▶

>>42960720 #

What's the value for a customer to pay a vendor that is only a wrapper around an LLM when they can leverage LLMs directly? I imagine tools being accessible for certain types of users, but for customers like those described here, you're better off replacing any OCR vendor with your own LLM integration

40. pmarreck ◴[17 Feb 25 17:05 UTC] No.43081041[source]▶

>>42957551 (TP) #

I have some out-of-print books that I want to convert into nice pdf's/epubs (like, reference-quality)

1) I don't mind destroying the binding to get the best quality. Any idea how I do so?

2) I have a multipage double-sided scanner (fujitsu scansnap). would this be sufficient to do the scan portion?

3) Is there anything that determines the font of the book text and reproduces that somehow? and that deals with things like bold and italic and applies that either as markdown output or what have you?

4) how do you de-paginate the raw text to reflow into (say) an epub or pdf format that will paginate based on the output device (page size/layout) specification?

↑