←back to thread

1303 points serjester | 9 comments | | HN request time: 0.001s | source | bottom
Show context
lazypenguin ◴[] No.42953665[source]
I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.

Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!

replies(33): >>42953680 #>>42953745 #>>42953799 #>>42954088 #>>42954472 #>>42955083 #>>42955470 #>>42955520 #>>42955824 #>>42956650 #>>42956937 #>>42957231 #>>42957551 #>>42957624 #>>42957905 #>>42958152 #>>42958534 #>>42958555 #>>42958869 #>>42959364 #>>42959695 #>>42959887 #>>42960847 #>>42960954 #>>42961030 #>>42961554 #>>42962009 #>>42963981 #>>42964161 #>>42965420 #>>42966080 #>>42989066 #>>43000649 #
panarky ◴[] No.42953745[source]
This is a big aha moment for me.

If Gemini can do semantic chunking at the same time as extraction, all for so cheap and with nearly perfect accuracy, and without brittle prompting incantation magic, this is huge.

replies(4): >>42953937 #>>42954961 #>>42956778 #>>42958312 #
1. potatoman22 ◴[] No.42953937[source]
Small point but is it doing semantic chunking, or loading the entire pdf into context? I've heard mixed results on semantic chunking.
replies(1): >>42954142 #
2. panarky ◴[] No.42954142[source]
It loads the entire PDF into context, but then it would be my job to chunk the output for RAG, and just doing arbitrary fixed-size blocks, or breaking on sentences or paragraphs is not ideal.

So I can ask Gemini to return chunks of variable size, where each chunk is a one complete idea or concept, without arbitrarily chopping a logical semantic segment into multiple chunks.

replies(2): >>42955403 #>>42956543 #
3. thelittleone ◴[] No.42955403[source]
Fixed size chunks is holding back a bunch of RAG projects on my backlog. Will be extremely pleased if this semantic chunking solves the issue. Currently we're getting around an 78-82% success on fixed size chunked RAG which is far too low. Users assume zero results on a RAG search equates to zero results in the source data.
replies(3): >>42955675 #>>42957023 #>>42958974 #
4. refulgentis ◴[] No.42955675{3}[source]
FWIW, you might be doing it / ruled it out already:

- BM25 to eliminate the 0 results in source data problem

- Longer term, a peek at Gwern's recent hierarchical embedding article. Got decent early returns even with fixed size chunks

replies(2): >>42955786 #>>42956001 #
5. thelittleone ◴[] No.42955786{4}[source]
Much appreciated.

For others interested in BM25 for the use case above, I found this thread informative.

https://news.ycombinator.com/item?id=41034297

6. mediaman ◴[] No.42956001{4}[source]
Agree, BM25 honestly does an amazing job on its own sometimes, especially if content is technical.

We use it in combination with semantic but sometimes turn off the semantic part to see what happens and are surprised with the robustness of the results.

This would work less well for cross-language or less technical content, however. It's great for acronyms, company or industry specific terms, project names, people, technical phrases, and so on.

7. Tostino ◴[] No.42956543[source]
I wish we had a local model for semantic chunking. I've been wanting one for ages, but haven't had the time to make a dataset and finetune that task =/.
8. jacobr1 ◴[] No.42957023{3}[source]
Also consider methods that are using reasoning to potentially dispatch additional searches based on analysis of the returned data
9. nnurmanov ◴[] No.42958974{3}[source]
This is my problem as well; do you have lots of documents?