> Crucially, we’ve seen very few instances where specific numerical values are actually misread. This suggests that most of Gemini’s “errors” are superficial formatting choices rather than substantive inaccuracies. We attach examples of these failure cases below [1].
> Beyond table parsing, Gemini consistently delivers near-perfect accuracy across all other facets of PDF-to-markdown conversion.
That seems fairly useful to me, no? Maybe not for mission critical applications, but for a lot of use cases, this seems to be good enough. I'm excited to try these prompts on my own later.
* Generous free tier
* Huge context window
* Lite version feels basically instant
However
* Lite model seems more prone to repeating itself / looping
* Very confusing naming e.g. {model}-latest worked for 1.5 but now its {model}-001? The lite has a date appended, the non-lite does not. Then there is exp and thinking exp...which has a date. wut?
Also regarding the failure case in the footnote, I think Gemini actually got that right (or at least outperformed Reducto) - the original document seems to have what I call a "3D" table where the third axis is rows within each cell, and having multiple headers is probably the best approximation in Markdown.
But how well does it actually handle that context window? E.g. a lot of models support 200K context, but the LLM can only really work with ~80K or so of it before it starts to get confused.
--
[1]: https://github.com/google-gemini/cookbook/blob/a916686f95f43...
Maybe incremental processing of chunks of the table would have worked, with subsequent stitching, but if Gemini can just process it that would be pretty good.
My problem statement is:
- Injest PDFs, summarize, and extract important information.
- Have some way to overlay the extracted information on the pdf in the UI.
- User can provide feedback on the overlaid info by accepting or rejecting the highlights as useful or not.
- This info goes back in to the model for reinforced learning.
Hoping to find something that can make this more manageable.
This is what I have found as well. From what I've read, LLMS do not work well with images for specific details due to image encoders which are too lossy. (No idea if this is actually correct.) For now I guess you can use regular OCR to get bounding boxes.
Something like this opens up a lot of use cases.
I'd probably try those first, since otherwise you're depending on the language model to do the right thing automagically.
Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!
I suspect the issue is prompt engineering related.
> Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.
> - Use the top-left coordinate system
> - Values should be percentages of the image width and height (0 to 1)
LLMs have enough trouble with integers (since token-wise integers and text representation of integers are the same), high-precision decimals will be even worse. It might be better to reframe the problem as "this input document is 850 px x 1100 px, return the bounding boxes as integers" then parse and calculate the decimals later.
Clickbait. It doesn't change "everything". It makes ingestion for RAG much less expensive (and therefore feasible in a lot more scenarios), at the expense of ~7% reduction in accuracy. Accuracy is already rather poor even before this, however, with the top alternative clocking in at 0.9. Gemini 2.0 is 0.84, although the author seems to suggest that the failure modes are mostly around formatting rather than e.g. mis-recognition or hallucinations.
TL;DR: is this exciting? If you do RAG, yes. Does it "change everything" nope. There's still a very long way to go. Protip for model designers: accuracy is always in greater demand than performance. A slow model that solves the problem is invariably better than a fast one that fucks everything up.
Gemini 2.0 is now available to everyone
Ironic, but GPT4o works better for me at longer contexts <128k than Gemini 2.0 flash. And out to 1m is just hopeless, even though you can do it.
But the bounding box problem hits close to home. We've found Unstructured's API gives pretty accurate box coordinates, and with some tweaks you can make them even better. The tricky part is implementing those tweaks without burning a hole in your wallet.
This is giving me hope that it's possible.
I suppose I'll try it again, for the 4th or 5th time.
This time I'm not excited. I'm expecting it to be a letdown.
you'll find that most of the errors here are structural issues with the table or inability to parse some special characters. tables can get crazy!
Is there is any code example with a full prompt available from OP, or are there any references (such as similar GitHub repos) for those looking to get started within this topic?
Your insights would be highly appreciated.
>Please provide me strict bounding boxes that encompasses the following text in the attached image? I'm trying to draw a rectangle around the text.
> - Use the top-left coordinate system
>this input document is 1080 x 1236 px. return the bounding boxes as integers
The tricky part is maintaining a mapping between your LLM extractions and these coordinates.
One way to do it would be with two LLM passes:
1. First pass: Extract all important information from the PDF
2. Second pass: "Hey LLM, find where each extraction appears in these bounded text chunks"
Not the cheapest approach since you're hitting the API twice, but it's straightforward!For this specific use case you can also try edgartools[1] which is a library that was relatively recently released that ingests SEC submissions and filings. They don't use OCR but (from what I can tell) directly parse the XBRL documents submitted by companies and stored in EDGAR, if they exist.
Thinking of the OCR vendors who get replaced. Where might they go?
One thing I can think of is that AI could help the space industry take off. But wondering if there are any concrete examples of new jobs being created.
I've tried Adobe Acrobat AI for this and it doesn't work yet. NotebookLM is it. The grounding is the reason it works - you can easily click on anything and it will take you to the source to verify it. My only gripe is that the visual display of the source material is _dogshit ugly_, like exceptionally so. Big blog pink background letters in lines of 24 characters! :) It has trouble displaying PDF columns, but at least it parses them. The ugly will change I'm sure :)
My projects are setup to let me bridge the gaps between the various sources and synthesize something more. It helps to have a goal and organize your sources around that. If you aren't focused, it gets confused. You lay the groundwork in sources and it helps you reason. It works so well I feel _tender_ towards it :) Survey papers provide background then you add specific sources in your area of focus. You can write a profile for how you would like NotebookLM to think - which REALLY helps out.
They are:
* The Stratigrapher - A Lovecraftian short story about the world's first city. All of Seton Lloyd/Faud Safar's work on Eridu. Various sources on Sumerian culture and religion All of Lovecraft's work and letters. Various sources about opium Some articles about nonlinear geometries
* FPGA Accelerated Graph Analytics An introduction to Verilog Papers on FPGAs and graph analytics Papers on Apache Spark architecture Papers on GraphFrames and a related rant I created about it and graph DBs A source on Spark-RAPIDS Papers on subgraph matching, graphlets, network motifs Papers on random graph models
* Graph machine learning notebook without a specific goal, which has been less successful. It helps to have a goal for the project. It got confused by how broad my sources were.
I would LOVE to share my projects with you all, but you can only share within a Google Workspaces domain. It will be AWESOME when they open this thing up :)
https://github.com/getomni-ai/zerox/pull/44
Related to
So I can ask Gemini to return chunks of variable size, where each chunk is a one complete idea or concept, without arbitrarily chopping a logical semantic segment into multiple chunks.
Quick calculation: Input pricing: Image input in 2.0 Flash is $0.0001935. Let's ignore the prompt. Output pricing: Let's assume 500 token per page, which is $0.0003
Cost per page: $0.0004935
That means 2,026 pages per dollar. Not 6,000!
Might still be cheaper than many solutions but I don't see where these numbers are coming from.
By the way, image input is much more expensive in Gemini 2.0 even for 2.0 Flash Lite.
Edit: The post says batch pricing, which would be 4k pages based on my calculation. Using batch pricing is pretty different though. Great if feasible but not practical in many contexts.
(a) you have document understanding use cases that you'd like to use gemini for (the more aspirational the better) and/or
(b) there are loss cases for which gemini doesn't work well today,
please feel free to email anirudhbaddepu@google.com and we'd love to help get your use case working & improve quality for our next series of model updates!
As if, when ChatGPT was introduced, Google would just stay still, cross their arms, and say “well, this is based on our research paper but there’s nothing we can do, going to just roll over and wait for billions of dollars to run out, we’re truly doomed”. So unbelievably stupid.
We’ve generally found that Gemini 2.0 is a great model and have tested this (and nearly every VLM) very extensively.
A big part of our research focus is incorporating the best of what new VLMs offer without losing the benefits and reliability of traditional CV models. A simple example of this is we’ve found bounding box based attribution to be a non-negotiable for many of our current customers. Citing the specific region in a document where an answer came from becomes (in our opinion) even MORE important when using large vision models in the loop, as there is a continued risk of hallucination.
Whether that matters in your product is ultimately use case dependent, but the more important challenge for us has been reliability in outputs. RD-TableBench currently uses a single table image on a page, but when testing with real world dense pages we find that VLMs deviate more. Sometimes that involves minor edits (summarizing a sentence but preserving meaning), but sometimes it’s a more serious case such as hallucinating large sets of content.
The more extreme case is that internally we fine tuned a version of Gemini 1.5 along with base Gemini 2.0, specifically for checkbox extraction. We found that even with a broad distribution of checkbox data we couldn’t prevent frequent checkbox hallucination on both the flash (+17% error rate) and pro model (+8% error rate). Our customers in industries like healthcare expect us to get it right, out of the box, deterministically, and our team’s directive is to get as close as we can to that ideal state.
We think that the ideal state involves a combination of the two. The flexibility that VLMs provide, for example with cases like handwriting, is what I think will make it possible to go from 80 or 90 percent accuracy to some number very close 99%. I should note that the Reducto performance for table extraction is with our pre-VLM table parsing pipeline, and we’ll have more to share in terms of updates there soon. For now, our focus is entirely on the performance frontier (though we do scale costs down with volume). In the longer term as inference becomes more efficient we want to move the needle on cost as well.
Overall though, I’m very excited about the progress here.
--- One small comment on your footnote, the evaluation script with Needlemen-Wunsch algorithm doesn’t actually consider the headers outputted by the models and looks only at the table structure itself.
Qwen2.5 VL was trained on a special HTML format for doing OCR with bounding boxes. [1] The resulting boxes aren't quite as accurate as something like Textract/Surya, but I've found they're much more accurate than Gemini or any other LLM.
OCR has always been “untrustworthy” (as in you cannot expect it to be 100% correct and know you must account for that) and we have long used ML algorithms for the process.
It will be ready for beta testing this week or the next, and I will be looking for beta testers; if interested please contact me!
Regardless of what assumptions you use - it's still an order of magnitude + improvement over anything else.
I get the inertia of the whole world being on PDF. And perhaps we can just eat the cost and let LLMs suffer the burden going forwards. But why not use that LLM coding brain power to create a better overall format?
I mean, do we really see printing things out onto paper something we need to worry about for the next 100 years? It reminds me of the TTY interface at the heart of Linux. There was a time it all made sense, but can we just deprecate it all now?
- A lot of natural chunk boundaries span multiple pages, so you need some 'sliding window' mechanism for the best accuracy.
- Passing the entire document hurts throughput too much due to the quadratic complexity of attention. Outputs are also much worse when you use too much context.
- Bounding boxes can be solved by first generating boxes using tradition OCR / layout recognition, then passing that data to the LLM. The LLM can then link it's outputs to the boxes. Unfortunately getting this reliable required a custom sampler so proprietary models like Gemini are out of the question.
Have seen MarkupX as a paid option, but it seems some AI in the loop can greatly speed up exception handling, encode family placement to certain elevations based on building code docs....
I am asking not to be cynical but because of my limited experience with using LLMs for any task that may operate on offensive or unknown input seems to get triggered by all sorts of unpredictable moral judgements and dragged into generating not the output I wanted, at all.
If I am asking this black box to give me a JSON output containing keywords for a certain text, if it happens to be offensive, it refuses to do that.
How does one tackle with that?
Alternatively, XML document formats and the like do exist. Indeed, HTML was supposed to be a document format. That’s not the problem. The problem is having people and systems actually author documents in that way in an unambiguous fashion, and having a uniform visual presentation for it that would be durable in the long term (decades at least).
PDF as a format persists because it supports virtually every feature under the sun (if authors care to use them), while largely guaranteeing a precisely defined visual presentation, and being one of the most stable formats.
[1] https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...
Recent Gemini models actually do extraordinarily well.
https://cloud.google.com/blog/products/ai-machine-learning/t...
I founded a doc processing company [1] and in our experience, a lot of the difficulty w/ deploying document processing into production is when accuracy requirements are high (> 97%). This is because OCR and parsing is only one part of the problem, and real world use cases need to bridge the gap between raw outputs and production-ready data.
This requires things like:
- state-of-the-art parsing powered by VLMs and OCR
- multi-step extraction powered by semantic chunking, bounding boxes, and citations
- processing modes for document parsing, classification, extraction, and splitting (e.g. long documents, or multi-document packages)
- tooling that lets nontechnical members quickly iterate, review results, and improve accuracy
- evaluation and benchmarking tools
- fine-tuning pipelines that turn reviewed corrections —> custom models
Very excited to get test and benchmark Gemini 2.0 in our product, very excited about the progress here.
I have a scanner, and some OCR processes I run things through. I am close to 85% from my automatic process.
The pain of going from 85% to 99% though is considerable. (and in my case manual) (well Perl helps)
I went to try this AI on one of the short poem manufscript I have.
I told the prompt I wanted PDF to Markdown, it says sure go ahead give me the pdf. I went upload it. It spent a long time spinning. then a quick messages comes up, something like
"Failed to count tokens"
but it just flashes and goes away.
I guess the PDF is too big? Weird though, its not a lot of pages.
They say there's no magic prompt but I'd start with their default since there is usually some format used to improve performance with posttraining with tasks like this
You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.
You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.
You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.
You feed each image box into a multimodal model to describe what the image is about.
For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.
You then stitch everything together in an XML file because Markdown is for human consumption.
You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.
You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.
You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.
I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.
Imagine you went to a government office looking for some document from 1930s, like an ancestors marriage or death certificate. You might want to digitize a facsimile of that using a camera or a scanner. You have a lot of options to store that, JPG, PNG, PDF. You have even more options to store the metadata (XML, RDF, TXT, SQLite, etc.). You could even get fancy and zip up an HTML doc alongside a directory of images/resources that stitched them all together. But there isn't really a good standard format to do that.
It is the second part of you post that stands out - the kitchen sink nature of PDFs that make them so terrible. If they were just wrappers for image data, formatted in a way that made printing them easy, I probably wouldn't dislike them.
If so this unlocks a massive workflow for us.
Marketing joke aside, maybe a hybrid approach could serve the vendor well. Best of both worlds if it reaps benefits or even have a look at hugging face for even more specialized aka better LLMs.
- BM25 to eliminate the 0 results in source data problem
- Longer term, a peek at Gwern's recent hierarchical embedding article. Got decent early returns even with fixed size chunks
Llama and DeepSeek are no-brainers; the weights are public.
For others interested in BM25 for the use case above, I found this thread informative.
One cannot possibly say that "Text extracted by a multimodal model cannot hallucinate"?
> accuracy was like 96% of that of the vendor and price was significantly cheaper.
I would like to know how this 96% was tested. If you use a human to do random sample based testing, well how do you adjust the random sample for variations in distribution of errors that vary like a small set of documents could have 90% of the errors and yet they are only 1% of the docs?
They are all probabilistic. You literally get back characters + confidence intervals. So when textract gives you back incorrect characters, is that a hallucination?
I don’t think this is clear at all. A multimodal LLM can and will hallucinate data at arbitrary scale (phrases, sentences, etc.). Since OCR is the part of the system that extracts the “ground truth” out of your source documents, this is an unacceptable risk IMO.
We use it in combination with semantic but sometimes turn off the semantic part to see what happens and are surprised with the robustness of the results.
This would work less well for cross-language or less technical content, however. It's great for acronyms, company or industry specific terms, project names, people, technical phrases, and so on.
The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database
Just to get sick with it we actually added some recusion to the Gemini step to have it rate how well it extracted, and if it was below a certain rating to rewrite its own instructions on how to extract the information and then feed it back into itself. We didn't see any improvement in accuracy, but it was still fun to do.
The hallucination in LLM extraction is much more subtle as it will rewrite full sentences sometimes. It is much harder to spot when reading the document and sounds very plausible.
We're currently working on a version where we send the document to two different LLMs, and use a 3rd if they don't match to increase confidence. That way you have the option of trading compute and cost for accuracy.
I don't claim PDF is a good format. It is inscrutable to me.
Oh god, I wish speech to text engines would colour code the whole thing like a heat map to focus your attention to review where it may have over-enthusiastically guessed at what was said.
You no knot.
We are working on compliance solution (https://fx-lex.com) and RAG just doesn’t cut it for our use case. Legislation cannot be chunked if you want the model to reason well about it.
It’s magical to be able to just throw everything into the model. And the best thing is that we automatically benefit from future model improvements along all performance axes.
Really though I just meant "it's a no-brainer that they are popular here on HN".
Correlating the two (Textract <-> AI) output is difficult, but another round of AI is usually good at that. Combined with some text-different scoring and logic, I can get pretty good full-document understanding of questions and answer locations. I've spent a pretty absurd amount of time on this and as of yet have not launched a product with it, but if anyone is interested I'd love to chat about the pipeline!
However we do very much recommend storing the raw model responses for audit and then at least as vector embeddings to orient the expectation that the data will need to be utilized for vector search and RAG. Kind of like "while we're here why don't we do what you're going to want to do at some point, even if it's not your use-case now..."
One can wonder how much wonkiness of llms comes from errors in extracting language from pdfs.
Adobe is the most harmful software development company in existence.
I integrated gemini recently to improve accuracy in certain blocks like tables. (get initial text, then pass to gemini to refine) Marker alone works about as well as gemini alone, but together they benchmark much better.
There should be laws that mandates that financial information be provided in a sensible format: even Office Open XML would be better than this insanity. Then we can redirect all this wasted effort into digging ditches and filling them back in again.
we have a technical blog on this exact phenomena coming out in the next couple days, will attach it here when it’s out!
check us out at https://www.runpulse.com
But on the flip side, layout is often times the biggest determinant of accuracy, and that's something LLMs do a way better job on. It doesn't matter if you have 100% accurate text from a table, but all that text is balled into one big paragraph.
Also the "pick the most plausible" approach is a blessing and a curse. A good example is the handwritten form here [1]. GPT 4o gets the all the email addresses correct because it can reasonably guess these people are all from the same company. Whereas AWS treats them all independently and returns three different emails.
Also I've been hearing good things regarding document retrieval about Gemini 1.5 Pro, 2.0 Flash and gemini-exp-1206 (the new 2.0 Pro?), which is the best Gemini model for data extraction from 100k tokens?
How do they compare against Claude Sonnet 3.5 or the OpenAI models, has anyone done any real world tests?
The table of costs in the blog post. At 500,000 pages per day the hardware fixed cost overcomes the software variable cost at day 240 and from then on you're paying an extra ~$100 per day to keep it running in the cloud. The machine also had to use extremely beefy GPUs to fit all the models it needed to. Compute utilization was between 5 to 10% which means that it's future proof for the next 5 years at the rate at which the data source was growing.
| Model | Pages per Dollar |
|-----------------------------+------------------|
| Gemini 2.0 Flash | ≈ 6,000 |
| Gemini 2.0 Flash Lite | ≈ 12,000* |
| Gemini 1.5 Flash | ≈ 10,000 |
| AWS Textract | ≈ 1,000 |
| Gemini 1.5 Pro | ≈ 700 |
| OpenAI 4o-mini | ≈ 450 |
| LlamaParse | ≈ 300 |
| OpenAI 4o | ≈ 200 |
| Anthropic claude-3-5-sonnet | ≈ 100 |
| Reducto | ≈ 100 |
| Chunkr | ≈ 100 |
There is also the fact that it's _completely_ local. Which meant we could throw in every proprietary data source that couldn't leave the company at it.>The longest part of the development timeline was establishing the accuracy rate and the ingestion pipeline, which itself is massively less complex than what your workflow sounds like: PDF -> Storage Bucket -> Gemini -> JSON response -> Database
Each company should build tools which match the skill level of their developers. If you're not comfortable training models locally with all that entails off the shelf solutions allow companies to punch way above their weight class in their industry.
As a mere occasional customer I've been scanning 4 to 5 pages of the same document layout every week in gemini for half a year, and every single week the results were slightly different.
To note the docs are bilingual so it could affect the results, but what stroke me is the lack of consistency, and even with the same model, running it two or three times in a row gives different results.
That's fine for my usage, but that sounds like a nightmare if everytime Google tweaks their model, companies have to reajust their whole process to deal with the discrepancies.
And sticking with the same model for multiple years also sound like a captive situation where you'd have to pay premium for Google to keep it available for your use.
It's easy if you try
No pdfs below us
Above us only SQL
Imagine all the people livin' for CSV
The number one take away we got was to use much larger images than anything that anyone else ever mentioned to get good results. A rule of thumb was that if you print the png of the image it should be easily readable from 2m away.
The actual model is proprietary and stuck in corporate land forever.
The rendering step for reports that humans got to see was a call to pandoc after the sxml was rendered to markdown - look ma we support powerpoint! - but it also allowed us to easily convert to whatever insane markup a given large (or small) language model worked best with on the fly.
Nice article! We're actively benchmarking Gemini 2.0 right now and if the results are as good as implied by this article, heck we'll adapt and improve upon it. Our goal (and in fact the reason our parser works so well) is to always use and stay on top of the latest SOTA models and tech :) - we blend LLM/VLM tech with best-in-class heuristic techniques.
Some quick notes: 1. I'm glad that LlamaParse is mentioned in the article, but it's not mentioned in the performance benchmarks. I'm pretty confident that our most accurate modes are at the top of the table benchmark - our stuff is pretty good.
2. There's a long tail of issues beyond just tables - this includes fonts, headers/footers, ability to recognize charts/images/form fields, and as other posters said, the ability to have fine-grained bounding boxes on the source elements. We've optimized our parser to tackle all of these modes, and we need proper benchmarks for that.
3. DIY'ing your own pipeline to run a VLM at scale to parse docs is surprisingly challenging. You need to orchestrate a robust system that can screenshot a bunch of pages at the right resolution (which can be quite slow), tune the prompts, and make sure you're obeying rate limits + can retry on failure.
How well does llamaparse work on foreign-language documents?
I have pipeline for Arabic-language docs using Azure for OCR and GPT-4o-mini to extract structured information. Would it be worth trying llamaparse to replace part of the pipeline or the whole thing?
Also, the real solution to the problem should have been for the IRS to just pre-fill tax returns with all the accounting data that they obviously already have. But that would require the government to care.
LibreOffice makes this especially easy to do: https://wiki.documentfoundation.org/Faq/Writer/PDF_Hybrid
Gemini 2.0 Flash seems better than 1.5 - https://deepmind.google/technologies/gemini/flash/
The problem then shifts from "can we extract this data from the PDF" to "how do we teach an LLM to extract the data we need, validate its performance, and deploy it with confidence into prod?"
You could improve your accuracy further by adding some chain-of-thought to your prompt btw. e.g. Make each field in your json schema have a `reasoning` field beforehand so the model can CoT how it got to its answer. If you want to take it to the next level, `citations` in our experience also improves performance (and when combined with bounding boxes, is powerful for human-in-the-loop tooling).
Disclaimer: I started an LLM doc processing infra company (https://extend.app/)
If you don't do that, you get a kitchen sink. If you need to store 1930s death certificats, 10k filings, your doctor's signup forms, the ARR graph for your startup, and a genealogy chart all in the same format, kitchen sink it is.
If it were "just a wrapper for image data", what exactly would that wrapper add? Semantic information, or a kitchen sink to manage additional info.
You're asking to store complex data without preserving complexity - I don't think that'll work.
Speaking from experience, you need to double check "I" and "l" and "1" "0" and "O" all the time, accuracy seems to depend on the font and some other factors.
have a util script I use locally to copy some token values out of screenshots from a VMWare client (long story) and I have to manually adjust 9/times.
How relevant that is or isn't depends on the use case.
If we had unlimited memory, compute and data we'd use a rank N tensor for an input of length N and call it a day.
Unfortunately N^N grows rather fast and we have to do all sorts of interesting engineering to make ML calculations complete before the heat death of the universe.
At the time, I evaluated multiple SDKs for both OCR and non-OCR PDF conversions, but none matched the accuracy of Adobe Acrobat’s built-in solution. In fact, at one point (don’t laugh), the company resorted to running Adobe Acrobat on a Windows machine with automation tools to handle the conversion. Using Adobe’s cloud service for conversion was not an option due to the proprietary nature of the PDFs. Additionally, its results were inconsistent and often worse compared to the desktop version of Adobe Acrobat!
Given that experience, I see this primarily as an HTML/text conversion challenge. If Gemini 2.0 truly improves upon existing solutions, it would be interesting to see a direct comparison against popular proprietary tools in terms of accuracy.
You are assuming you can match Gemini's performance, Google's engineering resources and costs being constant in to the future.
I’ve been wanting to build a system that ingests pdf reports that reference other types of data like images, csv, etc. that can also be ingested to ultimately build an analytics database from the stack of unsorted data AB’s meta data but I have not found any time to do anything like that yet. What kind of tooling do you use to build your data pipelines?
If it's superior (esp. for scans with text flowing around image boxes), and if you do end up packaging it up for brew, know that there's at least one developer who will benefit from your work (for a side-project, but that goes without saying).
Thanks in advance!
One of the standards that has come out of that is EN 16931, also known as ZUGFeRD and Factur-X, which basically involves embedding an XML file with the invoice details inside a PDF/A. It allows the PDF to be used like a regular PDF but it also allows the government procurement platforms to reliably parse the contents without any kind of intelligence.
It seems like a nice solution that would solve a lot of issues with ingesting PDFs for accounting if everyone somehow managed to agree a standard. Maybe if EN 16931 becomes more broadly available it might start getting used in the private sector too.
I'm not assuming. We already did, 18 months ago with better performance than the current generation of Gemini for our use case.
You're falling into the usual trap of thinking that because big tech spends big money it gets big results. It doesn't. To quote a friend who was a manager at google "If only I could get my team of 100 to be as productive as my first team of three.".
Finally, I must point out that statements in the vein of "Why [product] 2.0 Changes Everything" are more often than not a load of humbug.
The struggle which almost every ocr usecase faces is with handwritten documents(doctor prescriptions with bad handwriting) With gemini 1.5 flash we've had ~75-80% percent accuracy (based on random sampling by pharmacists). we're planning to improve this further by fine-tuning gemini models with medical data.
What could be other alternative services/models for accurate handwriting ocr?
I'm actually somewhat surprised Gemini didn't guess from context that LLC is much more likely?
I guess the OCR subsystem is intentionally conservative? (Though I'm sure you could do a second step on your end, take the output from the conservative OCR pass, and sent it through Gemini and ask it to flag potential OCR problems? I bet that would flag most of them with very few false positives and false negatives.)
Something that was clearly a table now becomes a bunch of glphy's physically close to eachother vs a group of other glphys but when considered as a group is a box visually separated from another group of glphys but actually part of a table.
I’m interested to hear more about the validation process here. In my limited experience, I’ve sent the same “document” to multiple LLMs and gotten differing results. But sometimes the “right” answer was in the minority of responses. But over a large sample (same general intent of document, but very different possible formats of the information within), there was no definitive winner. We’re still working on this.
[1] https://arxiv.org/abs/2308.00951 pg. 4 [2] https://152334h.github.io/blog/non-determinism-in-gpt-4/
As for markdown, great. Now how do you encode the meta data about the confidence of the model that the text says what it thinks it says? Becuase xml has this lovely thing called attributes that let's you keep a provenance record without a second database that's also readable by the llm.
And no, outlawing use the use of AI or increasing liability with its use will have next to nothing to deter its misuse and everyone knows it. My heart goes out to the remaining 15%.
That is you'd need 5 exa yotta bytes to solve it.
Currently the whole world has around 200 zetabytes of storage.
I short for the next 120 years mnist will need mathematical tricks to be solved.
Traditional OCR is more likely to skip characters, or replace them with similar -looking ones, so you often get AL or A1 instead of AI for example. In other words, traditional spelling mistakes. LLMs can do anything from hallucinating new paragraphs to slightly changing the meaning of a sentence. The text is still grammatically correct, it makes sense in the context, except that it's not what the document actually said.
I once gave it a hand-written list of words and their definitions and asked it to turn that into flashcards (a json array with "word" and "definition"). Traditional OCR struggled with this text, the results were extremely low-quality, badly formatted but still somewhat understandable. The few LLMs I've tried either straight up refused to do it, or gave me the correct list of words, but entirely hallucinated the definitions.
Or maybe the way to add new hallucinations. Nobody really knows. Just trust us bro, this is groundbreaking disruptive technology.
https://media.ccc.de/v/31c3_-_6558_-_de_-_saal_g_-_201412282...
Digital fax services will generate pdf files, for example. They're just image data dumped into a pdf. Various scanners will also do so.
Temperature changes the softmax equation[1], not weather or not you are sampling from the softmax result or choosing the highest probability. IBM's documentation corroborates this, saying you need to set do_sample to True in order for the temperature to have any effect, e.g., T changes how we sample, not if we sample [2].
A similar discussion on openai forum also claim that the RNG might be in a different state from run to run, although I am less sure about that [3]
[1] https://pelinbalci.com/2023/10/16/Temperature_parameter.html
[2] https://www.ibm.com/think/topics/llm-temperature#:~:text=The...
[3] https://community.openai.com/t/clarifications-on-setting-tem...
On information dense pages, LLMs often hallucinate half of the times, they have trouble understanding empty cells in tables, doesn't understand checkboxes, etc.
We had to invest heavily into building a state of the art layout understanding model and finally a table structure understanding for reliability. LLMs will get there, but there are some ways to go there.
Where they do well is in VQA type use cases, ask a question, very narrowly scoped, they will work much better than OCR+Layout models, because they are much more generalizable and flexible to use.
Sounds terrifying. How can you be sure that there were no conversion mistakes?
The EU regulations typically include delegated acts, technical standards, implementation standards and guidelines. With Gemini 2.0 we are able to just throw all of this into the model and have it figure out.
This approach gives way better results than anything we are able to achieve with RAG.
My personal bet is that this is how the future will look like. RAG will remain relevant, but only for extremely large document corpuses.
Its more about the information about the specific problem you are solving having less impact than techniques that target the compute. So in this case, breaking down how to parse a PDF in stages for your domain is involving specific expert knowledge of the domain, but training with attention is about efficient use of compute in general; with no domain expertise.
My intuition - not based on any research - is that recall should be a lot better from in context data vs. weights in the model. For our use case, precise recall is paramount.
Also if you're even considering fixed point math, you can use integer accumulators to add up your parallel chunks.
Is our tooling too bad for this?
Reaching reliability with LLM OCR might involve some combination of multiple LLMs (and keeping track of how they change), perhaps mixed with old-school algorithms, and random sample reviews by humans. They can tune this pipeline however they need at their leisure to eke out extra accuracy, and then put written guarantees on top, and still be cheaper for you long-term.
https://stackoverflow.com/questions/67358370/what-the-standa...
A smart vendor will shift into that space - they'll use that LLM themselves, and figure out some combination of finetunes, multiple LLMs, classical methods and human verification of random samples, that lets them not only "validate its performance, and deploy it with confidence into prod", but also sell that confidence with an SLA on top of it.
The write-up and ensuing conversation are really exciting. I think out of everything mentioned here - the clear stand-out point is that document layout analysis (DLA) is the crux of the issue for building practical doc ingestion for RAG.
(Note: DLA is the process of identifying and bounding specific segments of a document - like section headers, tables, formulas, footnotes, captions, etc.)
Strap in - this is going to be a longy.
We see a lot of people and products basically sending complete pages to LVLMs for converting to a machine-readable format, and for chunking. We tried this + it’s a possible configuration on chunkr as well. It has never worked for our customers, or during extensive internal testing across documents from a variety of verticals. Here are SOME of the common problems:
- Most documents are dense. The model will not OCR everything and miss crucial parts.
- A bunch of hallucinated content thats tough to catch.
- Occasionally it will just refuse to give you anything. We’ve tried a bunch of different prompting techniques and the models return “<image>” or “||..|..” for an ENTIRE PAGE of content.
Despite this - it’s obvious that these ginormous neural nets are great for complex conversions like tables and formulas to HTML/Markdown & LateX. They also work great for describing images and converting charts to tables. But that’s the thing - they can only do this if you can pull out these document features individually as cropped images and have the model focus on small snippets of the document rather than the full page.
If you want knobs for speed, quality, and cost, the best approach is to work at a segment level rather than a page level. This is where DLA really shines - the downstream processing options are vast and can be fit to specific needs. You can choose what to process with simple + fast OCR (text-only segments like headers, paragraphs, captions), and what to send to a large model like Gemini (complex segments like tables, formulas, and images) - all while getting juicy bounding boxes for mapping citations. Combine this with solid reading order algos - and you get amazing layout-aware chunking that takes ~10ms.
We made RAG apps ourselves and attempted to index all ~600 million pages of open-access research papers for https://lumina.sh. This is why we built Chunkr - and it needed to be Open Source. You can self-host our solution and process 4 pages per second, scaling up to 11 million pages per month on a single RTX 4090, renting this hardware on Runpod costs just $249/month ($0.34/hour).
A VLM to do DLA sounds awesome. We've played around with this idea but found that VLMs don't come close to models where the architecture is solely geared toward these specific object detection tasks. While it would simplify the pipeline, VLMs are significantly slower and more resource-hungry - they can't match the speed we achieve on consumer hardware with dedicated models. Nevertheless, the numerous advances in the field are very exciting - big if true!
A note on costs:
There are some discrepancies between the API pricing of providers listed in this thread. Assuming 100000 pages + feature parity:
Chunkr API - 200 pages for $1, not 100 pages
AWS Textract - 40 pages for $1, not 1000 pages (No VLMs)
Llama Parse - 13 pages for $1, not 300
A note on RD-Bench:
We’ve been using Gemini 1.5 Pro for tables and other complex segments for a while, so the RD-bench is very outdated. We ran it again on a few hundred samples and got a 0.81 (also includes some notes on the bench itself). To the OP: it would be awesome if you could update your blog post!
https://github.com/lumina-ai-inc/chunkr-table-rdbench/tree/m...
Model inference on GPU is mostly doing a lot of GPU equivalent of parallelized product on (X1, X2, X3, ... Xn), where each X is itself some matrix computed by a parallelized product of other matrices. Unless there's some explicit guarantee somewhere that the reduction step will pause until it gets all results so it can guarantee order, instead of reducing eagerly, each such step is a non-determinism transducer, turning undetermined execution order into floating point errors via commutation.
I'm not a GPU engineer so I don't know for sure, especially about the new cards designed for AI, but since reducing eagerly allows more memory-efficient design and improves throughput, and GPUs until recently were optimized for games (where FP accuracy doesn't matter that much), and I don't recall any vendor making determinism a marketing point recently, I don't believe GPUs suddenly started to guarantee determinism at expense of performance.
Or is there a use case for digital non-text pdfs? Are people really generating image and not text-based PDFs? Or is the primary use case extracting structure, rather than text?
That is impressive. However, if someone needs to read a couple of hundred pages per day, there's no point in setting all that up.
Also, you neglected to mention the cost of setting everything up. The machine cost $20k; but your time, and cost to train yolo8, probably cost more than that. If you want to compare costs (find a point where local implementation such as this is better ROI), you should compare fully loaded costs.
It's a bit late to start shifting now since it takes time. Ideally they should already have a product on the market.
You can recover word-level bounding boxes and confidence scores by using a traditional OCR engine such as AWS Textract and matching the results to Gemini’s output – see https://docless.app for a demo (disclaimer: I am the founder)
For most use cases in financial services, accurate data is very important.
Floating points are fundamentally too bad for this. We use them because they're fast, which usually more than compensates for inaccuracies FP math introduces.
(One, dealing with FP errors is mostly a fixed cost - there's a branch of CS/mathematics specializing in it, producing formally proven recipes for computing specific things in way that minimize or at least give specific bounds on errors. That's work that can be done once, and reused forever. Two, most programmers are oblivious to those issues anyway, and we've learned to live with the bugs :).)
When your parallel map-reduce is just doing matrix additions and multiplications, guaranteeing order of execution comes with serious overhead. For one, you need to have all partial results available together before reducing, so either the reduction step needs to have enough memory to store a copy of all the inputs, or it needs to block the units computing those inputs until all of them finish. Meanwhile, if you drop the order guarantee, then the reduction step just needs one fixed-size accumulator, and every parallel unit computing the inputs is free to go and do something else as soon as it's done.
So the price you pay for deterministic order is either a reduction of throughput or increase in on-chip memory, both of which end up translating to slower and more expensive hardware. The incentives strongly point towards not giving such guarantees if it can be avoided - keep in mind that GPUs have been designed for videogames (and graphics in general), and for this, floating point inaccuracies only matter when they become noticeable to the user.
Reducto's own model currently outperforms Gemini Flash 2.0 on this benchmark (0.90 vs 0.84). However, as we review the lower-performing examples, most discrepancies turn out to be minor structural variations that would not materially affect an LLM’s understanding of the table.
But it is a good opportunity for a fast-moving OCR service to steal some customers from their competition. If I were working in this space, I'd be worried about that, and also about the possibility some of the LLM companies realize they could actually break into this market themselves right now, and secure some additional income.
EDIT:
I get the feeling that the main LLM suppliers are purposefully sticking to general-purpose APIs and refraining from competing with anyone on specific services, and that this goes beyond just staying focused. Some of potential applications, like OCR, could turn into money printers if they moved on them now, and they all could use some more cash to offset what they burn on compute. Is it because they're trying to avoid starting an "us vs. them" war until after they made everyone else dependent on them?
If you already have a high optimized pipeline built yesterday, then sure keep using it.
But if you start dealing with PDF today, just use Gemini. Use the most human readable formats you can find because we know AI will be optimized on understanding that. Don't even think about "stitching XML files" blahblah.
On the other hand, these VLMs are very generic models – yes, they're trained on OCR tasks, but also a dozen of other tasks. As such, they're really good OCR models, but they tend to be not as well calibrated. We use VLMs at work (Qwen2-VL to be specific), and we don't find it hallucinates that often, but we're not dealing with long documents. I would assume that as you're dealing with a larger set of documents, you have a much larger context, which increases the chances of the model getting confused and hallucinating.
"Very few" is way too many. This means it cannot be trusted, especially when it comes to financial data.
The Gemini docs don't refect that change yet. https://discuss.ai.google.dev/t/safety-settings-2025-update-...
What you describe is obviously better and more robust by a lot, but the LLM only approach is not "wrong". It’s simple, fast, easy to setup and understand, and it works. With less accuracy but it does work. Depending on the constraints, development budget and load it’s a perfectly acceptable solution.
We did this to handle 2000 documents per month and are satisfied with the results. If we need to upgrade to something better in the future we will, but in the mean time, it’s done.
Then your secret sauce will be your fine tunes, etc.
Like it or not AI/LLM will be a commodity, and this bubble will burst. Moats are hard to build when you have at least one open source copy of what you just did.
It's supposed to say 234.1, not 234.4
But there is no GitHub link or details on the implementation. Only model available seems to be one for removing weather effects from images: https://github.com/TaoWangzj/GridFormer
Could you care to expand on how you would use GridFormer for extracting tables from images? Seems like it's not as trivial as using something like Excalibur or Tabula, both which seem more battle-tested.
So if it works, I’d be a fool not to use it.
In real world usage, many tables are badly misaligned. Headers are off. Lines are missing between rows. Some columns and rows are separated by colors. Cells are merged. Some are imported from Excel. There are dotted sub sections, tables inside cells etc. Claude (and now Gemini) can parse complex tables and convert that to meaningful data. Your solution will likely fail, because rules are fuzzy in the same way written language is fuzzy.
Recently someone posted this on HN, it's a good read: https://lukaspetersson.com/blog/2025/bitter-vertical/
> You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.
No, not like that, but often as nested Json or Xml. For financial documents, our accuracy was above 99%. There are many ways to do error checking to figure out which ones are likely to have errors.
> This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.
One should refrain making statements about cost without knowing how and where it'll be used. When processing millions of PDFs, it could be a problem. When processing 1000, one might prefer Gemini/other over spending engineering time. There are many apps where processing a single doc is say $10 in revenue. You don't care about OCR costs.
> I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.
The author presented techniques which worked for them. It may not work for you, because there's no one-size-fits-all for these kinds of problems.
This approach should make the LLM deterministic regardless of the temperature chosen.
P.S. Choosing lower and lower temperatures will make the LLM more deterministic but it will never be totally deterministic because there will always be some probability in other tokens. Also it is not possible to use temperature as exactly 0 due to exp(1/T) blowup. Like I mentioned above, you could avoid fiddling with temperature and just decide to always choose token with highest probability for full determinism.
There are probably other more subtle things that might make the LLM non-deterministic from run to run though. It could be due to some non-deterministism in the GPU/CPU hardware. Floating point is very sensitive to ordering.
TL;DR for as much determinism as possible just choose token with highest probability (i.e. dont sample the distribution).
wow, this is so bad. why do it now and introduce complexity and debt if you can do it later when you actually need it? you are just riding the hype wave and trying to get most out of it but that's fine.
OK, sure, we can parse a PDF reliably now, but now we need to act on that data. We need to store it, make sure it ends up with the right people who need to be notified that the data is available for their review. They then need to make decisions upon that data, possible requiring input from multiple stakeholders.
All that back and forth needs to be recorded and stored, along with the eventual decision and the all supporting documents and that whole bundle needs to be made available across multiple systems, which requires a bunch of ETLs and governance.
An LLM with a prompt doesn't replace all that.
Despite that, cheaper is better.
Why would the reduction step of a single neuron be split across multiple threads? That sounds slower and more complex than the naive method. And if you do decide to write code doing that, then just the code that reduces across multiple blocks needs to use integers, so pretty much no extra effort is needed.
Like, is there a nondeterministic-dot-product instruction baked into the GPU at a low level?
Marker output will be higher quality than docling output across most doc types, especially with the --use_llm flag. A few specific things we do differently:
- We have hybrid mode with gemini that merges tables across pages, improves quality on forms, etc.
- we run an ordering model, so ordering is better for docs where the PDF orde ris bad
- OCR is a lot better, we train our own model, surya - https://github.com/VikParuchuri/surya
- References and links
- Better equation conversion (soon including inline)
https://static.foxnews.com/foxnews.com/content/uploads/2023/...
but were often written with typewriters long ago to get nice structured tabular output. Deals with text being split across lines and across pages just fine.
AI founders will learn the bitter lesson
https://news.ycombinator.com/item?id=42672790 - 25 days ago, 263 comments
The HN discussion contains a lot of interesting ideas, thanks for the pointer!
Businesses that are just "today's LLM + our bespoke improvements" won't have legs.
- data is never shared between customers
- data never gets used for training
- we also configure data retention policies to auto-purge after a time period
Many financial regulators require you to publish heavily marked up statements with iXBRL. These markups reveal nuances in the numbers that OCRing a post processed table will not understand.
Of course, financial documents are a narrow subset of the problem.
Maybe the problem is with PDF as a format: Unfortunately PDFs lose that meta information when they are built from source documents.
I can't help but feel that PDFs could probably be more portable as their acronym indicates.
The same reason I don't wait until it snows to buy snowboots. I know my environment, topography, scale, risk-profile, and costs, and can concieve of innumerable use-cases for when they will be necessary, even if it's only May, when snowboots happen to be on sale ;) What's a little closet space and the burden of locking my door when I leave the house in the interim?
B2B is different from B2C, so if one vendor has a handful of clients and they won't switch away, there's no obliterating happening.
What's opened up is even lower hanging fruit, on more trees. A SaaS company charging $3/month for the left-handed underwater basket weaver niche now becomes viable as a lifestyle business. The shovels in this could be supabase/similar, since clients can keep access to their data there even if they change frontends.
PDF is terrible because it has grown over time from a format that was originally made for one purpose into a format that is used for too many purposes. That organic growth has caused PDFs to be very difficult to use for a wide variety of use cases.
That opinion doesn't imply almost anything else that you have claimed I support (and generally do not).
There's definitely space here to help the customer realize their extraction vision because it's still hard to scale this effectively on your own!
In this case, the reason for the misinformation is do to the lack of communication from the DOGE entity regarding their actions. Mr. Musk wrote via Tweet that he had "deleted" the digital services agency "18F" that develops the IRS Free File program and also deleted their X account.
https://apnews.com/article/irs-direct-file-musk-18f-6a4dc35a...
If indeed he did cut the agency, it remains to be see how long the application will be operational.
Good post. VLM models are improving and Gemini 2.0 definitely changes the doc prep and ingestion pipeline across the board.
What we're finding as we work with enterprise customers:
1. Attribution is super important, and VLMs are there yet. Combining them with layout analysis makes for a winning combo.
2. VLMs are great at prompt-based extraction, but if you have document automation and you don't know where in tables you'll be searching or need to reproduce faithfully -- then precise table extraction is important.
3. VLMs will continue to get better, but the price points are a result of economies of scale that document parsing vendors don't get. On the flip side, document parsing vendors have deployment models that Gemini can't reach.
Literally whoever has the cheapest compute.
With the speed that AI models are improving these days, it seems like the 'moat' of a better model is only a few months before it is commoditized and goes to the cheapest provider.
I would say most acceleracionist/AI bulls/etc don't really understand the true essential complexity in software development. LLMs are being seen as a software development silver bullets, and we know what happens with silver bullets.
I have worked in large systems, both in code and people, compilers, massive data processing systems, 10k business units.
Is that what you mean and, if so, is there anything in particular you've seen that leads you to see these problems being solved well or on the 18 month timeline? That sounds interesting to look at to me and I'd love to know more.
1). I'm incompetent enough to ignore publicly available table benchmarks.
2). I'm incompetent enough to never look at poor quality data.
3). I'm incompetent enough to not create a validation dataset for all models that were available.
Needless to say you're wrong on all three.
My day rate is $400 + taxes per hour if you want to be run through each point and why VLMs like Gemini fail spectacularly and unpredictably when left to their own devices.
You can't do point sampling to figure out where things are going. We have to look at the slope. People see a paper come out, look at the results and say, "this fails for x, y and z. doesn't work", that is now how scientific research works. This is why two minute papers has the tag line, "hold on to your papers ... two papers down the line ..."
Copy and paste the whole thread into a SOTA model and have meta me explain it.
P.S. - You can find us here (unsiloed-ai.com) or you can reach out to me on adnan.abbas@unsiloed-ai.com
Not unless you control the underlying scheduler and force deterministic order; knowledge of all the code running isn't sufficient, as some factors affecting threading order are correlated with physical environment. For example, minute temperature gradient differences on the chip between two runs could affect how threads are allocated to CPU cores and order in which they finish.
> Why would the reduction step of a single neuron be split across multiple threads?
Doesn't have to, but can, depending on how many inputs it has. Being able to assume commutativity gives you a lot of flexibility in how you parallelize it, and allows you to minimize overhead (both in throughput and memory requirements).
> Like, is there a nondeterministic-dot-product instruction baked into the GPU at a low level?
No. There's just no dot-product instruction baked into GPU at low level that could handle vectors of arbitrary length. You need to write a loop, and that usually becomes some kind of parallel reduce.
I'm very confused by how you're interpreting the word "each" here.
> Being able to assume commutativity gives you a lot of flexibility in how you parallelize it, and allows you to minimize overhead (both in throughput and memory requirements).
Splitting up a single neuron seems like something that would only increase overhead. Can you please explain how you get "a lot" of flexibility?
> You need to write a loop, and that usually becomes some kind of parallel reduce.
Processing a layer is a loop within a loop.
The outer loop is across neurons and needs to be parallel.
The inner loop processes every weight for a single neuron and making it parallel sounds like extra effort just to increase instruction count and mess up memory locality and make your numbers less consistent.
Then, remember that GPUs are built around thousands of tiny parallel processors, each able to process a bunch (e.g. 16) parallel threads, but then the threads have to run in larger batches (SIMD-like), and there's a complex memory management architecture built-in, over which you only have so much control. Specific numbers of cores, threads, buffer sizes, as well as access patterns, differ between GPU models, and for optimal performance, you have to break down your computation to maximize utilization. Or rather, have the runtime do it for you.
This ain't an an FPGA, you don't get to organize hardware to match your network. If you have a 1000 neurons per hidden layer, then individual neurons likely won't fit on a single CUDA core, so you will have to split them down the middle, at least if you're using full-float math. Speaking of, the precision of the numbers you use is another parameter that adds to the complexity.
On the one hand, you have a bunch of mostly-linear matrix algebra, where you can tune precision. On the other hand, you have a GPU-model-specific number of parallel processors (~thousands), that can fit only so much memory, can run some specific number of SIMD-like threads in parallel, and most of those numbers are powers of two (or a multiple of), so you have also alignment to take into account, on top of memory access patterns.
By default, your network will in no way align to any of that.
It shouldn't be hard to see that assuming commutativity gives you (or rather the CUDA compiler) much more flexibility to parallelize your calculations by splitting it whichever way it likes to maximize utilization.
You can do very wide calculations on a single neuron if you want; throwing an entire SM (64 or 128 CUDA cores) at a single neuron is trivial to do in a deterministic way. And if you have a calculation so big you benefit from splitting it across SMs, doing a deterministic sum at the end will use an unmeasurably small fraction of your runtime.
And I'll remind you that I wasn't even talking about determinism across architectures, just within an architecture, so go ahead and optimize your memory layouts and block sizes to your exact card.
- Docling : https://ds4sd.github.io/docling/
They're doubting you because the non-digital portions of processes change at people/org speed.
Which is to say that changing a core business process is a year political consensus, rearchitecture, and change management effort, because you also have to coordinate all the cascading and interfacing changes.
What you are describing is similar to how computer used to detect cats. You first extract edges, texture and gradient. Then use a sliding window and run a classifier. Then you use NMS to merge the bounding boxes.
You are thinking within the existing structures, those structures will evaporate. All along the software supply chain, processes will get upended, not just because of how technical assets will be created, but also how organizations themselves are structured and react and in turn how software is created and consumed.
This is as big as the invention of the corporation, the printing press and the industrial revolution.
I am not here to tutor people on this viewpoint or defend it, I offer it and everyone can do with it what they will.
Not sure why you think buying an entire inference server is a necessity to run these models.
When I sent images of PDF page with extracted text, Gemini mixed headlines with body text, parsed tables incorrectly, and sometimes split tables—placing one part at the top of the page and the rest at the bottom. It also added random numbers (like inserting an “8” for no reason).
When using the Gemini SDK to process full PDFs, Gemini 1.5 could handle them, but Gemini 2.0 only processed the first page. Worse, both versions completely ignored tables.
Among the Gemini models, 1.5 Pro performed the best, reaching about 80% of GPT-4o’s accuracy with image parsing, but it still introduced numerous small errors.
In conclusion, no Gemini model is reliable for PDF-to-Markdown parsing and beyond the hype - I still need to use GPT-4o.
LLM extractions are searched in OCR output, and if matched, the bounding box is displayed based on OCR output.
Demo: app.github.ai (just register an account and try) Github: https://github.com/analytiq-hub/doc-router
Reach out to me at andrei@analytiqhub.com for questions. Am looking for feedback and collaborators.
One level of privacy is the workspace level separation in Mongo. But, if there is customer interest, other setups are possible. E.g. the way Databricks handles privacy is by actually giving each account its own back end services - and scoping workspaces within an account.
That is a good possible model.
1) I don't mind destroying the binding to get the best quality. How do I do so?
2) I have a multipage double-sided scanner (fujitsu scansnap). would this be sufficient to do the scan portion?
3) Is there anything that determines the font of the book text and reproduces that somehow? and that deals with things like bold and italic and applies that either as markdown output or what have you?
4) how do you de-paginate the raw text to reflow into (say) an epub format that will paginate based on the output device specification?
1) I don't mind destroying the binding to get the best quality. Any idea how I do so?
2) I have a multipage double-sided scanner (fujitsu scansnap). would this be sufficient to do the scan portion?
3) Is there anything that determines the font of the book text and reproduces that somehow? and that deals with things like bold and italic and applies that either as markdown output or what have you?
4) how do you de-paginate the raw text to reflow into (say) an epub or pdf format that will paginate based on the output device (page size/layout) specification?