Ingesting PDFs and why Gemini 2.0 changes everything

(www.sergey.fyi)

1303 points serjester | 2 comments | 05 Feb 25 18:05 UTC | HN request time: 0.411s | source

Show context

lazypenguin ◴[05 Feb 25 19:19 UTC] No.42953665[source]▶

I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.

Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!

replies(33): >>42953680 #>>42953745 #>>42953799 #>>42954088 #>>42954472 #>>42955083 #>>42955470 #>>42955520 #>>42955824 #>>42956650 #>>42956937 #>>42957231 #>>42957551 #>>42957624 #>>42957905 #>>42958152 #>>42958534 #>>42958555 #>>42958869 #>>42959364 #>>42959695 #>>42959887 #>>42960847 #>>42960954 #>>42961030 #>>42961554 #>>42962009 #>>42963981 #>>42964161 #>>42965420 #>>42966080 #>>42989066 #>>43000649 #

makeitdouble ◴[05 Feb 25 23:35 UTC] No.42956937[source]▶

>>42953665 #

> After trial and error with different models

As a mere occasional customer I've been scanning 4 to 5 pages of the same document layout every week in gemini for half a year, and every single week the results were slightly different.

To note the docs are bilingual so it could affect the results, but what stroke me is the lack of consistency, and even with the same model, running it two or three times in a row gives different results.

That's fine for my usage, but that sounds like a nightmare if everytime Google tweaks their model, companies have to reajust their whole process to deal with the discrepancies.

And sticking with the same model for multiple years also sound like a captive situation where you'd have to pay premium for Google to keep it available for your use.

replies(4): >>42957385 #>>42957436 #>>42960435 #>>42962678 #

iandanforth ◴[06 Feb 25 00:32 UTC] No.42957436[source]▶

>>42956937 #

At temperature zero, if you're using the same API/model, this really should not be the case. None of the big players update their APIs without some name / version change.

replies(2): >>42958471 #>>42960778 #

pigscantfly ◴[06 Feb 25 02:58 UTC] No.42958471[source]▶

>>42957436 #

This isn't really true unfortunately -- mixture of experts routing seems to suffer from batch non-determinism. No one has stated publicly exactly why this is, but you can easily replicate the behavior yourself or find bug reports / discussion with a bit of searching. The outcome and observed behavior of the major closed-weight LLM APIs is that a temperature of zero no longer corresponds to deterministic greedy sampling.

replies(1): >>42959211 #

brookst ◴[06 Feb 25 04:51 UTC] No.42959211[source]▶

>>42958471 #

If temperature is zero, and weights are weights, where is the non-deterministic behavior coming from?

replies(5): >>42959232 #>>42959979 #>>42960090 #>>42960416 #>>42961722 #

TeMPOraL ◴[06 Feb 25 07:26 UTC] No.42959979[source]▶

>>42959211 #

Here probably routing would be dominating, but in general, unless I missed all the vendors ditching GPUs and switching to ASICs optimized for fixed precision math, floating points are still non-commutative therefore results are non-deterministic wrt. randomness introduced by parallelising the calculations.

replies(2): >>42960601 #>>42960610 #

Dylan16807 ◴[06 Feb 25 09:22 UTC] No.42960601[source]▶

>>42959979 #

Why would the same software on the same GPU architecture use different commutations from run to run?

Also if you're even considering fixed point math, you can use integer accumulators to add up your parallel chunks.

replies(1): >>42960879 #

TeMPOraL ◴[06 Feb 25 10:08 UTC] No.42960879[source]▶

>>42960601 #

Why would the same multithreaded software run on the same CPU (not just architecture - the same physical chip) have its instructions execute in different order from run to run? Performance. Want things deterministic? You have to explicitly keep them in sync yourself. GPUs sport tens of thousands of parallel processors these days, which are themselves complex, and are linked together with more complexity, both hardware and software. They're designed to calculate fast, not to ensure every subprocessor is always in lock step with every other one.

Model inference on GPU is mostly doing a lot of GPU equivalent of parallelized product on (X1, X2, X3, ... Xn), where each X is itself some matrix computed by a parallelized product of other matrices. Unless there's some explicit guarantee somewhere that the reduction step will pause until it gets all results so it can guarantee order, instead of reducing eagerly, each such step is a non-determinism transducer, turning undetermined execution order into floating point errors via commutation.

I'm not a GPU engineer so I don't know for sure, especially about the new cards designed for AI, but since reducing eagerly allows more memory-efficient design and improves throughput, and GPUs until recently were optimized for games (where FP accuracy doesn't matter that much), and I don't recall any vendor making determinism a marketing point recently, I don't believe GPUs suddenly started to guarantee determinism at expense of performance.

replies(1): >>42964003 #

Dylan16807 ◴[06 Feb 25 16:32 UTC] No.42964003[source]▶

>>42960879 #

Each thread on a CPU will go in the same order.

Why would the reduction step of a single neuron be split across multiple threads? That sounds slower and more complex than the naive method. And if you do decide to write code doing that, then just the code that reduces across multiple blocks needs to use integers, so pretty much no extra effort is needed.

Like, is there a nondeterministic-dot-product instruction baked into the GPU at a low level?

replies(1): >>42971002 #

TeMPOraL ◴[07 Feb 25 09:52 UTC] No.42971002[source]▶

>>42964003 #

> Each thread on a CPU will go in the same order.

Not unless you control the underlying scheduler and force deterministic order; knowledge of all the code running isn't sufficient, as some factors affecting threading order are correlated with physical environment. For example, minute temperature gradient differences on the chip between two runs could affect how threads are allocated to CPU cores and order in which they finish.

> Why would the reduction step of a single neuron be split across multiple threads?

Doesn't have to, but can, depending on how many inputs it has. Being able to assume commutativity gives you a lot of flexibility in how you parallelize it, and allows you to minimize overhead (both in throughput and memory requirements).

> Like, is there a nondeterministic-dot-product instruction baked into the GPU at a low level?

No. There's just no dot-product instruction baked into GPU at low level that could handle vectors of arbitrary length. You need to write a loop, and that usually becomes some kind of parallel reduce.

replies(1): >>42971187 #

Dylan16807 ◴[07 Feb 25 10:31 UTC] No.42971187[source]▶

>>42971002 #

> could affect how threads are allocated to CPU cores and order in which they finish

I'm very confused by how you're interpreting the word "each" here.

> Being able to assume commutativity gives you a lot of flexibility in how you parallelize it, and allows you to minimize overhead (both in throughput and memory requirements).

Splitting up a single neuron seems like something that would only increase overhead. Can you please explain how you get "a lot" of flexibility?

> You need to write a loop, and that usually becomes some kind of parallel reduce.

Processing a layer is a loop within a loop.

The outer loop is across neurons and needs to be parallel.

The inner loop processes every weight for a single neuron and making it parallel sounds like extra effort just to increase instruction count and mess up memory locality and make your numbers less consistent.

replies(1): >>42971500 #

1. TeMPOraL ◴[07 Feb 25 11:35 UTC] No.42971500[source]▶

>>42971187 #

I feel like you're imagining a toy network with couple dozen neurons in few layers, done on a CPU. But consider a more typical case of dozens of layers with hundreds (or thousands) of neurons each. That's some thousand numbers to reduce per each neuron.

Then, remember that GPUs are built around thousands of tiny parallel processors, each able to process a bunch (e.g. 16) parallel threads, but then the threads have to run in larger batches (SIMD-like), and there's a complex memory management architecture built-in, over which you only have so much control. Specific numbers of cores, threads, buffer sizes, as well as access patterns, differ between GPU models, and for optimal performance, you have to break down your computation to maximize utilization. Or rather, have the runtime do it for you.

This ain't an an FPGA, you don't get to organize hardware to match your network. If you have a 1000 neurons per hidden layer, then individual neurons likely won't fit on a single CUDA core, so you will have to split them down the middle, at least if you're using full-float math. Speaking of, the precision of the numbers you use is another parameter that adds to the complexity.

On the one hand, you have a bunch of mostly-linear matrix algebra, where you can tune precision. On the other hand, you have a GPU-model-specific number of parallel processors (~thousands), that can fit only so much memory, can run some specific number of SIMD-like threads in parallel, and most of those numbers are powers of two (or a multiple of), so you have also alignment to take into account, on top of memory access patterns.

By default, your network will in no way align to any of that.

It shouldn't be hard to see that assuming commutativity gives you (or rather the CUDA compiler) much more flexibility to parallelize your calculations by splitting it whichever way it likes to maximize utilization.

replies(1): >>42976132 #

2. Dylan16807 ◴[07 Feb 25 18:55 UTC] No.42976132[source]▶

>>42971500 (TP) #

I'm not imagining toy sizes. Quite the opposite. I'm saying that layers are so big that splitting per neuron already gives you a ton of individual calculations to schedule and that's plenty to get full usage out of the hardware.

You can do very wide calculations on a single neuron if you want; throwing an entire SM (64 or 128 CUDA cores) at a single neuron is trivial to do in a deterministic way. And if you have a calculation so big you benefit from splitting it across SMs, doing a deterministic sum at the end will use an unmeasurably small fraction of your runtime.

And I'll remind you that I wasn't even talking about determinism across architectures, just within an architecture, so go ahead and optimize your memory layouts and block sizes to your exact card.

↑