Ingesting PDFs and why Gemini 2.0 changes everything

(www.sergey.fyi)

1303 points serjester | 2 comments | 05 Feb 25 18:05 UTC | HN request time: 0.474s | source

Show context

lazypenguin ◴[05 Feb 25 19:19 UTC] No.42953665[source]▶

I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.

Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!

replies(33): >>42953680 #>>42953745 #>>42953799 #>>42954088 #>>42954472 #>>42955083 #>>42955470 #>>42955520 #>>42955824 #>>42956650 #>>42956937 #>>42957231 #>>42957551 #>>42957624 #>>42957905 #>>42958152 #>>42958534 #>>42958555 #>>42958869 #>>42959364 #>>42959695 #>>42959887 #>>42960847 #>>42960954 #>>42961030 #>>42961554 #>>42962009 #>>42963981 #>>42964161 #>>42965420 #>>42966080 #>>42989066 #>>43000649 #

makeitdouble ◴[05 Feb 25 23:35 UTC] No.42956937[source]▶

>>42953665 #

> After trial and error with different models

As a mere occasional customer I've been scanning 4 to 5 pages of the same document layout every week in gemini for half a year, and every single week the results were slightly different.

To note the docs are bilingual so it could affect the results, but what stroke me is the lack of consistency, and even with the same model, running it two or three times in a row gives different results.

That's fine for my usage, but that sounds like a nightmare if everytime Google tweaks their model, companies have to reajust their whole process to deal with the discrepancies.

And sticking with the same model for multiple years also sound like a captive situation where you'd have to pay premium for Google to keep it available for your use.

replies(4): >>42957385 #>>42957436 #>>42960435 #>>42962678 #

iandanforth ◴[06 Feb 25 00:32 UTC] No.42957436[source]▶

>>42956937 #

At temperature zero, if you're using the same API/model, this really should not be the case. None of the big players update their APIs without some name / version change.

replies(2): >>42958471 #>>42960778 #

pigscantfly ◴[06 Feb 25 02:58 UTC] No.42958471[source]▶

>>42957436 #

This isn't really true unfortunately -- mixture of experts routing seems to suffer from batch non-determinism. No one has stated publicly exactly why this is, but you can easily replicate the behavior yourself or find bug reports / discussion with a bit of searching. The outcome and observed behavior of the major closed-weight LLM APIs is that a temperature of zero no longer corresponds to deterministic greedy sampling.

replies(1): >>42959211 #

brookst ◴[06 Feb 25 04:51 UTC] No.42959211[source]▶

>>42958471 #

If temperature is zero, and weights are weights, where is the non-deterministic behavior coming from?

replies(5): >>42959232 #>>42959979 #>>42960090 #>>42960416 #>>42961722 #

petesergeant ◴[06 Feb 25 04:55 UTC] No.42959232[source]▶

>>42959211 #

The parent is suggesting that temperature only applies at the generation step, but the choice of backend “expert model” that a request is given to (and then performs the generation) is non-deterministic. Rather than being a single set of weights, there are a few different sets of weights that constitute the “expert” in MoE. I have no idea if that’s true, but that’s the assertion

replies(1): >>42959382 #

brookst ◴[06 Feb 25 05:27 UTC] No.42959382[source]▶

>>42959232 #

I don't think it makes sense? Somewhere there has to be a RNG for that to be true. MOE itself doesn't introduce randomness, and the routing to experts is part of the model weights, not (I think) a separate model.

replies(2): >>42959524 #>>42960047 #

1. pigscantfly ◴[06 Feb 25 06:01 UTC] No.42959524[source]▶

>>42959382 #

The samples your input is batched with on the provider's backend vary between calls and sparse mixture of experts routing when implemented for efficient utilization induces competition among tokens with either encouraged or enforced balance of expert usage among tokens in the same fixed-size group. I think it's unknown or at least undisclosed exactly why sequence non-determinism at zero temperature occurs in these proprietary implementations, but I think this is a good theory.

[1] https://arxiv.org/abs/2308.00951 pg. 4 [2] https://152334h.github.io/blog/non-determinism-in-gpt-4/

replies(1): >>42959629 #

2. kettleballroll ◴[06 Feb 25 06:25 UTC] No.42959629[source]▶

>>42959524 (TP) #

I thought the temperature only affects randomness at the end of the network (when turning embeddings back I to words using the softmax). It cannot influence routing, which is inherently influenced by which examples get batched together (ie, it might depend on other users of the system)

↑