Llama-OCR: Document to Markdown

1. Eisenstein ◴[16 Nov 24 06:03 UTC] No.42154707[source]▶

All it does is send the image to Llama 3.2 Vision and ask for it to read the text.

Note that this is just as open to hallucination as any other LLM output, because what it is doing is not reading the pixels looking for text characters, but describing the picture, which uses the images it trained on and their captions to determine what the text is. It may completely make up words, especially if it can't read them.

replies(1): >>42154755 #

2. M4v3R ◴[16 Nov 24 06:12 UTC] No.42154755[source]▶

>>42154707 (TP) #

This is also true for any other OCR system, we just never called these errors “hallucinations” in this context.

replies(4): >>42154787 #>>42154980 #>>42155011 #>>42155143 #

3. llm_trw ◴[16 Nov 24 06:20 UTC] No.42154787[source]▶

>>42154755 #

It really isn't since those systems are character based.

4. geysersam ◴[16 Nov 24 07:08 UTC] No.42154980[source]▶

>>42154755 #

I gave this tool a picture of a restaurant menu and it made up several additional entries that didn't exist in the picture... What other OCR system would do that?

5. 8n4vidtmkvmk ◴[16 Nov 24 07:17 UTC] No.42155011[source]▶

>>42154755 #

OCR tools sometimes make errors, but they don't make things up. There's a difference.

6. noduerme ◴[16 Nov 24 07:55 UTC] No.42155143[source]▶

>>42154755 #

No, it's not even close to OCR systems, which are based on analyzing points in a grid for each character stroke and comparing them with known characters. Just for one thing, OCR systems are deterministic. Deterministic. Look it up.

replies(2): >>42155209 #>>42155470 #

7. visarga ◴[16 Nov 24 08:12 UTC] No.42155209{3}[source]▶

>>42155143 #

OCR system use vision models and as such they can make mistakes. They don't sample but they produce a distribution of probability over words like LLMs.

replies(1): >>42155496 #

8. alex_suzuki ◴[16 Nov 24 09:26 UTC] No.42155470{3}[source]▶

>>42155143 #

One of my worries for the coming years is that people will forget what deterministic actually means. It terrifies me!

replies(1): >>42193944 #

9. ◴[16 Nov 24 09:32 UTC] No.42155496{4}[source]▶

>>42155209 #

10. noduerme ◴[20 Nov 24 13:54 UTC] No.42193944{4}[source]▶

>>42155470 #

Not to get real dark and philosophical (but here goes) it took somewhere around 150,000 years for humans to go from spoken language to writing. And almost all of those words were irrational. From there to understanding and encoding what is or isn't provable, or is or isn't logically deterministic, took the last few hundred years. And people who have been steeped in looking at the world through that lens (whether you deal with pure math or need to understand, e.g. by running a casino, what is not deterministic, so as to add it to your understanding of volatility and risk) are able to identify which factors in any scenario are deterministic or not very quickly. One could almost say that this ability to discern logic from fuzz is the crowning achievement of science and civilization, and the main adaptation conferred upon some humans since speech. Unfortunately, it is very recent, and it's still an open question as to whether it's an evolutionary advantage to be able to tell the difference between magic and process. And yeah, it's scary to imagine a world where people can't; but that was practically the whole world a few centuries ago, and it wouldn't be terribly surprising if humanity regressed to that as they stopped understanding how to make tools and most people began treating tools like magic again. Sad time to be alive.