Most active commenters

(5)
notsylver(4)
Eisenstein(3)
8n4vidtmkvmk(3)
nutlope(3)
noduerme(3)
cess11(3)
pbhjpbhj(3)

Popular/hot comments

>>42154841 #
>>42155007 #
>>42155156 #
>>42154472 #
>>42154755 #
>>42155767 #

Llama-OCR: Document to Markdown

(llamaocr.com)

1. bbor ◴[16 Nov 24 05:11 UTC] No.42154472[source]▶

Looks awesome! Been doing a lot of OCR recently, and love the addition to the space. The reigning champion in the PDF -> Markdown space (AFAIK) is Facebook's Nougat[1], and I'm excited to hook this up to DSPy and see which works better for philosophy books. This repo links the Zerox[2] project by some startup, which also looks awesome, and certainly more smoothly advertised than Nougat. Would love corrections/advice from any actual experts passing by this comment section :)

That said, I have a few questions if OP/anyone knows the answers:

1. What is Together.ai, and is this model OSS? Their website sells them as a hosting service, and the "Custom Models" page[3] seems to be about custom finetuning, not, like, training new proprietary models in-house. They might have a HuggingFace profile but it's hard to tell if it's them https://huggingface.co/TogetherAI

2. The GitHub says "hosted demo", but the hosting part is just the tiny (clean!) WebGUI, yes? It's implied that this functionality is and will always be available only through API calls?

P.S. The header links are broken on my desktop browser -- no onClick triggered

[1] https://facebookresearch.github.io/nougat/

[2] https://github.com/getomni-ai/zerox

[3] https://www.together.ai/products#custom-models

replies(4): >>42154592 #>>42154679 #>>42154719 #>>42154807 #

2. magicalhippo ◴[16 Nov 24 05:37 UTC] No.42154592[source]▶

>>42154472 #

Yeah was hoping for something I could self-host, both for privacy and cost.

3. LeoPanthera ◴[16 Nov 24 05:55 UTC] No.42154666[source]▶

>>42154410 (OP) #

I wonder what the watts-per-character is of this tool.

replies(1): >>42154732 #

4. gexla ◴[16 Nov 24 05:57 UTC] No.42154677[source]▶

>>42154410 (OP) #

Should this be a "Show HN" post? Seems to just be the front-end and has no association we may make with the name Llama? Maybe together.ai gave them cloud space?

5. gexla ◴[16 Nov 24 05:58 UTC] No.42154679[source]▶

>>42154472 #

My guess is together.ai is at least partially sponsoring the demo.

6. Eisenstein ◴[16 Nov 24 06:03 UTC] No.42154707[source]▶

>>42154410 (OP) #

All it does is send the image to Llama 3.2 Vision and ask for it to read the text.

Note that this is just as open to hallucination as any other LLM output, because what it is doing is not reading the pixels looking for text characters, but describing the picture, which uses the images it trained on and their captions to determine what the text is. It may completely make up words, especially if it can't read them.

replies(1): >>42154755 #

7. jurnalanas ◴[16 Nov 24 06:05 UTC] No.42154719[source]▶

>>42154472 #

the project author is Devrel from Together.ai. This is a fantastic way to advertise a dev tool, though.

8. threatripper ◴[16 Nov 24 06:08 UTC] No.42154732[source]▶

>>42154666 #

Joules per character

replies(2): >>42154834 #>>42156277 #

9. M4v3R ◴[16 Nov 24 06:12 UTC] No.42154755[source]▶

>>42154707 #

This is also true for any other OCR system, we just never called these errors “hallucinations” in this context.

replies(4): >>42154787 #>>42154980 #>>42155011 #>>42155143 #

10. d1sxeyes ◴[16 Nov 24 06:18 UTC] No.42154776[source]▶

>>42154410 (OP) #

Seemed pretty good with handwriting. Didn’t make any mistakes with numbers in the sample I tried.

11. llm_trw ◴[16 Nov 24 06:20 UTC] No.42154787{3}[source]▶

>>42154755 #

It really isn't since those systems are character based.

12. rajansheth ◴[16 Nov 24 06:24 UTC] No.42154807[source]▶

>>42154472 #

together.ai serves 100+ open-source models including multi-modal Llama 3.2 with an OpenAI compatible API

13. sumedh ◴[16 Nov 24 06:26 UTC] No.42154811[source]▶

>>42154410 (OP) #

Site is dead now :(

replies(1): >>42155017 #

14. danielEM ◴[16 Nov 24 06:34 UTC] No.42154834{3}[source]▶

>>42154732 #

I think it is perfectly fine to describe it in Watts per character as you can easily determine how many characters per second you can process.

15. notsylver ◴[16 Nov 24 06:34 UTC] No.42154841[source]▶

>>42154410 (OP) #

I've been doing a lot of OCR recently, mostly digitising text from family photos. Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close. It still had enough failures and hallucinations to make it faster to write it in by hand. Annoying considering how close it feels to working.

This seems worse. Sometimes it replies with just the text, sometimes it replies with a full "The image is a scanned document with handwritten text...". I was hoping for some fine tuning or something for it to beat Gemini Flash, it would save me a lot of time. :(

replies(7): >>42154901 #>>42155002 #>>42155087 #>>42155372 #>>42155438 #>>42156428 #>>42156646 #

16. og_kalu ◴[16 Nov 24 06:50 UTC] No.42154901[source]▶

>>42154841 #

>Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close.

For Normal models, the state of Open Source OCR is pretty terrible. Unfortunately, the closed options from Microsoft, Google etc are much better. Did you try those ?

Interesting about Flash, what LLMs did you test ?

replies(2): >>42155032 #>>42156731 #

17. geysersam ◴[16 Nov 24 07:08 UTC] No.42154980{3}[source]▶

>>42154755 #

I gave this tool a picture of a restaurant menu and it made up several additional entries that didn't exist in the picture... What other OCR system would do that?

18. anothername12 ◴[16 Nov 24 07:10 UTC] No.42154988[source]▶

>>42154410 (OP) #

We tried this and it was an absolute shit show for us.

replies(1): >>42155868 #

19. 8n4vidtmkvmk ◴[16 Nov 24 07:15 UTC] No.42155002[source]▶

>>42154841 #

That's a bummer. I'm trying to do the exact same thing right now, digitize family photos. Some of mine have German on the back. The last OCR to hit headlines was terrible, was hoping this would be better. ChatGPT 4o has been good though, when I paste individual images into the chat. I haven't tried with the API yet, not sure how much that would cost me to process 6500 photos, many of which are blank but I don't have an easy way to filter them either.

replies(2): >>42155142 #>>42155260 #

20. nutlope ◴[16 Nov 24 07:16 UTC] No.42155007[source]▶

>>42154410 (OP) #

Hi all, I'm the author of llama-ocr. Thank you for sharing & for the kind comments! I built this earlier this week since I wanted a simple API to do OCR – it uses llama 3.2 vision (hosted on together.ai, where i work) to parse images into structured markdown. I also have it available as an npm package.

Planning to add a bunch of other features like the ability to parse PDFs, output a response in JSON, ect... If anyone has any questions, feel free to send them and I'll try to respond!

replies(5): >>42155235 #>>42155376 #>>42155942 #>>42158372 #>>42159434 #

21. 8n4vidtmkvmk ◴[16 Nov 24 07:17 UTC] No.42155011{3}[source]▶

>>42154755 #

OCR tools sometimes make errors, but they don't make things up. There's a difference.

22. nutlope ◴[16 Nov 24 07:20 UTC] No.42155017[source]▶

>>42154811 #

Should be up, please try again!

replies(1): >>42155629 #

23. nutlope ◴[16 Nov 24 07:20 UTC] No.42155019[source]▶

>>42154589 #

Thank you!

24. notsylver ◴[16 Nov 24 07:23 UTC] No.42155032{3}[source]▶

>>42154901 #

I tried open source and closed source OCR models, all were pretty bad. Google vision was probably the best of the "OCR" models, but it liked adding spaces between characters and had other issues I've forgotten. It was bad enough that I wondered if I was using it wrong. By the time I was trying to pass the text to an LLM with the image so it could do "touchups" and fix the mistakes, I gave up and decided to try LLMs for the whole task.

I don't remember the exact models, I more or less just went through the OpenRouter vision model list and tried them all. Gemini Flash performed the best, somehow better than Gemini Pro. GPT-4o/mini was terrible and expensive enough that it would have had to be near perfect to consider it. Pixtral did terribly. That's all I remember, but I tried more than just those. I think Llama 3.2 is the only one I haven't properly tried, but I don't have high hopes for it.

I think even if OCR models were perfect, they couldn't have done some of the things I was using LLMs for. Like extracting structured information at the same time as the plain text - extracting any dates listed in the text into a standard ISO format was nice, as well as grabbing peoples names. Being able to say "Only look at the hand-written text, ignore printed text" and have it work was incredible.

replies(2): >>42155515 #>>42155819 #

25. philips ◴[16 Nov 24 07:34 UTC] No.42155081[source]▶

>>42154410 (OP) #

I have recently used llama3.2-vision to handle some paper bidsheets for a charity auction and it is fairly accurate with some terrible handwriting. I hope to use it for my event next year.

I do find it rather annoying not being able to get it to consistently output a CSV though. ChatGPT and Gemini seem better at doing that but I haven’t tried to automate it.

The scale of my problem is about 100 pages of bidsheets and so some manual cleaning is ok. It is certainly better than burning volunteers time.

https://github.com/philips/paper-bidsheets

replies(2): >>42155583 #>>42164693 #

26. philips ◴[16 Nov 24 07:37 UTC] No.42155087[source]▶

>>42154841 #

Have you tried downscaling the images? I started getting better results with lower resolution images. I was using scans made with mobile phone cameras for this.

convert -density 76 input.pdf output-%d.png

https://github.com/philips/paper-bidsheets

replies(1): >>42155225 #

27. ◴[16 Nov 24 07:41 UTC] No.42155101[source]▶

>>42154410 (OP) #

28. noduerme ◴[16 Nov 24 07:46 UTC] No.42155114[source]▶

>>42154410 (OP) #

Um, I just quickly uploaded an unstructured RTF file to this and apparently broke it... unless it's just realllly slow.

If this is just for converting hand-written documents, maybe put that in the header of the website. Right now it just says "Document to Markdown", which could be interpreted lots of different ways.

29. nash ◴[16 Nov 24 07:49 UTC] No.42155121[source]▶

>>42154410 (OP) #

Holy Hallucinations batman!

Even the example images hallucinates random text

replies(1): >>42155170 #

30. bosie ◴[16 Nov 24 07:55 UTC] No.42155142{3}[source]▶

>>42155002 #

Use a local rubbish model to extract text. If it doesn’t find any on the back, don’t send it to chatgtp?

Terrascan comes to mind

replies(1): >>42159947 #

31. noduerme ◴[16 Nov 24 07:55 UTC] No.42155143{3}[source]▶

>>42154755 #

No, it's not even close to OCR systems, which are based on analyzing points in a grid for each character stroke and comparing them with known characters. Just for one thing, OCR systems are deterministic. Deterministic. Look it up.

replies(2): >>42155209 #>>42155470 #

32. mg ◴[16 Nov 24 07:59 UTC] No.42155156[source]▶

>>42154410 (OP) #

I gave it a sentence, which I created by placing 500 circles via a genetic algorithm to form a sentence. And then drew with an actual physical circle:

https://www.instagram.com/marekgibney/p/BiFNyYBhvGr/

Interestingly, it sees the circles just fine, but not the sentence. It replied with this:

    The image contains no text or other elements
    that can be represented in Markdown. It is a
    visual composition of circles and does not
    convey any information that can be translated
    into Markdown format.

replies(5): >>42155181 #>>42155186 #>>42155206 #>>42155424 #>>42156784 #

33. KeplerBoy ◴[16 Nov 24 08:04 UTC] No.42155170[source]▶

>>42155121 #

Same for me. The receipt headline only says "Trader Joe's" and yet the model insists on adding some information and transcribes "Trader Joe's Receipt". This is like Xeroxgate, but infinitely worse.

Someday this will do great damage in ways we will completely neglect and overlook.

34. echoangle ◴[16 Nov 24 08:06 UTC] No.42155181[source]▶

>>42155156 #

I can’t read anything but the „stop“ either without seeing the solution first

35. DandyDev ◴[16 Nov 24 08:07 UTC] No.42155186[source]▶

>>42155156 #

I can't read this either.

Edit: at a distance it's easier to read

replies(1): >>42155287 #

36. wasyl ◴[16 Nov 24 08:11 UTC] No.42155206[source]▶

>>42155156 #

Why is it interesting? The image does not look like anything, and you need to skew it (by looking at an angle) to see any letters (barely).

37. visarga ◴[16 Nov 24 08:12 UTC] No.42155209{4}[source]▶

>>42155143 #

OCR system use vision models and as such they can make mistakes. They don't sample but they produce a distribution of probability over words like LLMs.

replies(1): >>42155496 #

38. notsylver ◴[16 Nov 24 08:16 UTC] No.42155225{3}[source]▶

>>42155087 #

That's interesting. I downscaled the images to something like 800px but that was mostly to try improve upload times. I wonder if downscaling further and with a better algorithm would help.. I remember using CLIP and found different scaling algorithms helped text readability. Maybe the text is just being butchered when its rescaled.

Though I also tried with the high detail setting which I think would deal with most issues that come from that and it didn't seem to help much

39. Curiositry ◴[16 Nov 24 08:20 UTC] No.42155235[source]▶

>>42155007 #

Option to use a local LLM?

replies(1): >>42155548 #

40. notsylver ◴[16 Nov 24 08:29 UTC] No.42155260{3}[source]▶

>>42155002 #

I found 4o to be one of the worst, but I was using the API. I didn't test it but sometimes it feels like images uploaded through ChatGPT work better than ones through the API. I was using Gemini Flash in the end, it seemed better than 4o and the images are so cheap that I have a hard time believing google is making any money even by bandwidth costs

I also tried preprocessing images before sending them through. I tried cropping it to just the text to see if it helped. Then I tried filtering on top to try brighten the text, somehow that all made it worse. The most success I had was just holding the image in my hand and taking a photo of it, the busy background seemed to help but I have absolutely no idea why.

The main problem was that it would work well for a few dozen images, you'd start to trust it, and then it'd hallucinate or not understand a crossed out word with a correction or wouldn't see text that had faded. I've pretty much given up on the idea. My new plan is to repurpose the website I made for verifying the results into one where you enter the text manually, as well as date/location/favourite status.

41. thih9 ◴[16 Nov 24 08:36 UTC] No.42155287{3}[source]▶

>>42155186 #

If you squint it’s easier too. I wonder if lowering the resolution of the image would make the text visible to ocr.

replies(1): >>42156818 #

42. AmazingTurtle ◴[16 Nov 24 08:46 UTC] No.42155326[source]▶

>>42154410 (OP) #

One can combine apache tika OCR and feed it together with the image into LLM to fix typos.

replies(1): >>42156396 #

43. ◴[16 Nov 24 08:59 UTC] No.42155372[source]▶

>>42154841 #

44. nh2 ◴[16 Nov 24 09:00 UTC] No.42155376[source]▶

>>42155007 #

I put in a bill that has 3 identical line items and it didn't include them as 3 bullet points as usual, but generated a table with a "quantity" column that doesn't exist on the original paper.

Is this amount of larger transformation expected/desirable?

(It also means that the output is sometimes a bullet point list, sometimes a table, making further automatic processing a bit harder.)

replies(1): >>42156858 #

45. Vetch ◴[16 Nov 24 09:13 UTC] No.42155424[source]▶

>>42155156 #

Based on the fact that squinting works, I applied a Gaussian blur to the image. Here's the response I got:

Markdown:

The provided image is a blurred text that reads "STOP THINKING IN CIRCLES." There are no other visible elements such as headers, footers, subtexts, images, or tables.

Markdown Content:

STOP THINKING IN CIRCLES

As the response is not deterministic, I also tried several times with the unprocessed image but it never worked. However, all the low-pass filter effects I applied worked with a high success rate.

https://imgur.com/q7Zd7fa

replies(1): >>42155596 #

46. bboygravity ◴[16 Nov 24 09:18 UTC] No.42155438[source]▶

>>42154841 #

Have you tried Claude?

It's not good at returning the locations of text (yet), but it's insane at OCR as far as I have tested.

47. alex_suzuki ◴[16 Nov 24 09:26 UTC] No.42155470{4}[source]▶

>>42155143 #

One of my worries for the coming years is that people will forget what deterministic actually means. It terrifies me!

replies(1): >>42193944 #

48. ◴[16 Nov 24 09:32 UTC] No.42155496{5}[source]▶

>>42155209 #

49. Eisenstein ◴[16 Nov 24 09:47 UTC] No.42155548{3}[source]▶

>>42155235 #

I made a script which does exactly the same thing but locally using koboldcpp for inference. It downloads MiniCPM-V 2.6 with image projector the first time you run it. If you want to use a different model you can, but you will want to edit the instruct template to match.

* https://github.com/jabberjabberjabber/LLMOCR

replies(1): >>42155615 #

50. mosselman ◴[16 Nov 24 09:59 UTC] No.42155583[source]▶

>>42155081 #

What about using llama3.2-vision to do the OCR bit and then deferring to ChatGPT to do the CSV part?

51. mg ◴[16 Nov 24 10:03 UTC] No.42155596{3}[source]▶

>>42155424 #

I guess blurring it is similar to reducing the resolution or to looking at the image from further away.

It's interesting that the neural net figures out the circles, but not the words. Because the circles are also not so easily apparent from looking closely at the image. It could also be whirly lines.

52. nirav72 ◴[16 Nov 24 10:08 UTC] No.42155615{4}[source]▶

>>42155548 #

MiniCPM-v 2.6 is probably the best self-hosted vision model I have used so far. Not just for OCR, but also image analysis. I have it setup, so my NVR (frigate) sends couple of images upon motion alert from a driveway security camera to Ollama with minicpm-v 2.6. I’m able to get a reasonably accurate description of the vehicle that pulled into the driveway. Including describing the person that exits the vehicle and also the license plate. All sent to my phone.

replies(1): >>42163822 #

53. mkl ◴[16 Nov 24 10:15 UTC] No.42155629{3}[source]▶

>>42155017 #

It let me upload a file, but didn't produce any output.

54. alecco ◴[16 Nov 24 10:54 UTC] No.42155767[source]▶

>>42154410 (OP) #

Is it possible to do this locally with open source software? I have a lot of accounting PDFs to convert but due to privacy concerns it should not run in the cloud.

replies(4): >>42155855 #>>42156382 #>>42156587 #>>42156958 #

55. revskill ◴[16 Nov 24 11:06 UTC] No.42155810[source]▶

>>42154410 (OP) #

Non-English image is slow.

56. dleeftink ◴[16 Nov 24 11:09 UTC] No.42155819{4}[source]▶

>>42155032 #

WordNinja is pretty good as a post-processing step on wrongly split/concatenated words:

[0]: https://github.com/keredson/wordninja

57. criddell ◴[16 Nov 24 11:22 UTC] No.42155855[source]▶

>>42155767 #

Does it have to be open source, or just running locally? The paid version of Acrobat does this well. MacOS has pretty good built-in OCR capabilities and Windows isn’t far behind.

If you have the hardware for it, you can run some LLMs locally. Although for accounting data, I probably wouldn’t trust it.

58. cpursley ◴[16 Nov 24 11:27 UTC] No.42155868[source]▶

>>42154988 #

You could have at least provided some constructive feedback...

59. Szpadel ◴[16 Nov 24 11:55 UTC] No.42155942[source]▶

>>42155007 #

> Need an example image? Try ours. Great idea, I wish more services would have similar feature

60. cheema33 ◴[16 Nov 24 11:56 UTC] No.42155947[source]▶

>>42154410 (OP) #

I uploaded a multi-page PDF and it did not know what to do. This is before I went to the github repo and noticed that it wasn't supported. I think the tool should let the user know when they upload a file that is not supported.

61. sdflhasjd ◴[16 Nov 24 12:22 UTC] No.42156055[source]▶

>>42154410 (OP) #

Here's a bit of a quirk: I uploaded a webcomic as an example, all the dialog was ALL CAPS, but the output was inconsistently either sentence case or title case between panels.

I also tried some real examples a problem I'd like to use OCR with: I've got some old slides that needs digitising, and most of them are labelled, uploading one of these provides the output:

  The image appears to be a photograph of a slide or film frame, possibly from an old camera or projector. The slide is yellowed with age and has a rectangular cutout in the center, which is filled with a dark gray or black material. The cutout is surrounded by a thin border, and there is some text written on the slide in black ink.

  The text reads "Once Upon a Time" and is written in a cursive font. It is located at the bottom of the slide, below the cutout. There is also a small number "1069" written in the same font and color, but it is not clear what this number refers to.

  Overall, the image suggests that the slide is an old photograph or film frame that has been preserved for many years. The yellowing of the slide and the cursive writing suggest that it may be from the early 20th century or earlier.

So aside from unnecessary repetitious description of the slide, (and the "yellowing" is actually just white balance being off, though I can forgive that), the actual written text (not cursive) was "Once Uniquitous." and the number was 106g. It's very clearly a 'g' and not a '9'.

What I think is interesting about this is that it might be a demonstration of biases in models, it focuses too much on the slide being an antique that it hallucinated a completely cliche title. Also, it missed the forest for the trees and that the "black square" was the slide being front-lit so the text could be read, so the transparency wasn't visible.

Additionally, the API itself seems to have file size or resolution limits that are not documented

62. Tepix ◴[16 Nov 24 12:56 UTC] No.42156168[source]▶

>>42154410 (OP) #

So, i uploaded a HN screenshot and it showed some rendered text but where is the Markdown code? A site titles "Document to Markdown" that fails to give me the MarkDown? What am i overlooking?

63. amelius ◴[16 Nov 24 13:07 UTC] No.42156223[source]▶

>>42154410 (OP) #

I tried it on a Walmart receipt. It misread a 9 for a 0.

https://imgur.com/a/ni8zOmb

64. hrpnk ◴[16 Nov 24 13:10 UTC] No.42156231[source]▶

>>42154410 (OP) #

Reading the Llama community license agreement, section "Redistribution and Use" I expected to find 'Built with Llama'. Is this not required?

https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instr... links to the community license.

replies(2): >>42156244 #>>42156435 #

65. kennethwolters ◴[16 Nov 24 13:14 UTC] No.42156244[source]▶

>>42156231 #

Why don't you think that calling the app "Llama-OCR" is good enough?

replies(1): >>42156280 #

66. amelius ◴[16 Nov 24 13:24 UTC] No.42156277{3}[source]▶

>>42154732 #

I'm running this with 60Hz on my HDMI output.

67. sdflhasjd ◴[16 Nov 24 13:25 UTC] No.42156280{3}[source]▶

>>42156244 #

The license is pretty specific, if the API counts as a "service".

  i. If you distribute or make available the Llama Materials (or any derivative works thereof), or a product or service (including another AI model) that contains any of them, you shall (A) provide a copy of this Agreement with any such Llama Materials; and (B) prominently display “Built with Llama” on a related website, user interface, blogpost, about page, or product documentation.

68. constantinum ◴[16 Nov 24 13:27 UTC] No.42156289[source]▶

>>42154410 (OP) #

The problem with using LLMs for OCR is hallucinations. It makes it impossible to use in business use cases such as insurance, banking and health/medical — which demands high accuracy or predictable inaccuracy rate. Not to mention handling scale — processing millions of documents with speed and affordable costs.

For all the test use cases mentioned in this thread, I’d suggest trying LLMwhisperer. A general purpose text Pre-processor/OCR built for LLM consumption. https://pg.llmwhisperer.unstract.com

69. cess11 ◴[16 Nov 24 13:46 UTC] No.42156382[source]▶

>>42155767 #

Either you need to be somewhat tolerant when it comes to misinterpretations and hallucinations, or you'll be proofreading a lot.

A cheap hack is to push the documents through pdftotext from Poppler and if nothing or very little comes out, push them through OCRMyPDF and pipe it to pdftotext. If it's scanned you probably want some flags for deskewing and so on.

To make a bulk load of PDF mostly greppable it's a decent technique, to get every 0 as a 0 you're probably going to proofread every conversion.

70. cess11 ◴[16 Nov 24 13:49 UTC] No.42156396[source]▶

>>42155326 #

While I'm a fan of Tika a lot of people get queasy from Java and XML, they might be better served by their preferred scripting language and https://github.com/ocrmypdf/OCRmyPDF, which has the same OCR engine.

replies(1): >>42163052 #

71. ◴[16 Nov 24 13:53 UTC] No.42156428[source]▶

>>42154841 #

72. ◴[16 Nov 24 13:54 UTC] No.42156435[source]▶

>>42156231 #

73. xenodium ◴[16 Nov 24 13:54 UTC] No.42156437[source]▶

>>42154410 (OP) #

Japanese OCR to structured content works very well via chatgpt API.

https://xenodium.com/images/chatgpt-shell-repo-splits-up/jap...

Other unrelated examples https://lmno.lol/alvaro/chatgpt-shell-repo-splits-up

74. joeyblueee ◴[16 Nov 24 14:04 UTC] No.42156495[source]▶

>>42154410 (OP) #

get this error in console when requesting /ocr, and a 504 status code """ An error occurred with your deployment

FUNCTION_INVOCATION_TIMEOUT """

75. burnt-resistor ◴[16 Nov 24 14:11 UTC] No.42156535[source]▶

>>42154410 (OP) #

I might've broken it as I gave it the Intel developer’s manual combined volumes. }:)

76. Eisenstein ◴[16 Nov 24 14:24 UTC] No.42156587[source]▶

>>42155767 #

I don't recommend using it for anything important unless you very diligently proofread it, but I made one that runs locally that I linked to elsewhere in this post:

* https://news.ycombinator.com/item?id=42155548

77. danvk ◴[16 Nov 24 14:35 UTC] No.42156646[source]▶

>>42154841 #

I've had really good luck recently running OCR over a corpus of images using gpt-4o. The most important thing I realized was that non-fancy data prep is still important, even with fancy LLMs. Cropping my images to just the text (excluding any borders) and increasing the contrast of the image helped enormously. (I wrote about this in 2015 and this post still holds up well with GPT: https://www.danvk.org/2015/01/07/finding-blocks-of-text-in-a...).

I also found that giving GPT at most a few paragraphs at a time worked better than giving it whole pages. Shorter text = less chance to hallucinate.

replies(1): >>42156712 #

78. pbhjpbhj ◴[16 Nov 24 14:52 UTC] No.42156712{3}[source]▶

>>42156646 #

Have you tried doing a verification pass: so giving gpt-4o the output of the first pass, and the image, and asking if they can correct the text (or if they match, or...)?

Just curious whether repetition increases accuracy or of it hurt increases the opportunities for hallucinations?

replies(1): >>42163981 #

79. fros1y ◴[16 Nov 24 14:53 UTC] No.42156715[source]▶

>>42154410 (OP) #

Are there any OCR engines out there that actually recognizes underlines properly? Even the LLMs seem to struggle to model the underline (though they get the text fine).

80. pbhjpbhj ◴[16 Nov 24 14:56 UTC] No.42156731{3}[source]▶

>>42154901 #

The OCR in OneNote is incredible IME. But, I've not tested in a wide range of fonts -- only that I have abysmal handwriting and it will find words that are almost unrecognisable.

81. ggerules ◴[16 Nov 24 15:05 UTC] No.42156784[source]▶

>>42155156 #

Was the original LLM ever trained on original material like this?

Pretty cool use of genetic algorithm! Would love to see the code or at least the reward function.

82. pbhjpbhj ◴[16 Nov 24 15:09 UTC] No.42156818{4}[source]▶

>>42155287 #

I wonder if you could do a composite image, like bracketed images, and so give the model multiple goes, for which it could amalgamate results. So, you could do an exposure bracket, do a focus/blur, maybe a stretch/compression, or an adjustment for font-height as a proportion of the image.

Feed all of the alternatives to the model, tell it they each have the same textual content?

83. zainia ◴[16 Nov 24 15:18 UTC] No.42156858{3}[source]▶

>>42155376 #

Here's the prompt being used, tweaking that might help: https://github.com/Nutlope/llama-ocr/blob/main/src/index.ts#...

84. bugglebeetle ◴[16 Nov 24 15:45 UTC] No.42156958[source]▶

>>42155767 #

Yes, Docling and Marker do very similar things and can be run fully locally.

85. MattDaEskimo ◴[16 Nov 24 16:44 UTC] No.42157312[source]▶

>>42154410 (OP) #

Dreamt of fine design, layers of code, art refined— found wrappers instead.

Nothing to see here folks.

86. rasz ◴[16 Nov 24 17:37 UTC] No.42157649[source]▶

>>42154410 (OP) #

Old scan of Asus P3B-F motherboard schematic from 1997.

- only managed to extract some of the text from Title Block (project name, date etc)

- despite distinct font got all 8/B and 1/I mixed up.

- the actual useful info got turned into

    Tables
    Table 1: [Insert table 1 here]

    Other Elements
    [Insert other elements here]

87. generalizations ◴[16 Nov 24 18:57 UTC] No.42158346[source]▶

>>42154410 (OP) #

How does it handle images? That has seemed to be the major weak point of these doc-to-markdown systems.

88. gcr ◴[16 Nov 24 19:00 UTC] No.42158372[source]▶

>>42155007 #

How accurate is this?

When compared with existing OCR systems, what sorts of mistakes does it make?

89. rch ◴[16 Nov 24 21:18 UTC] No.42159434[source]▶

>>42155007 #

I've had trouble with pulling scientific content out of poster PDFs, mostly because e.g. nougat falls apart with different layouts.

Have you considered that usage yet?

90. 8n4vidtmkvmk ◴[16 Nov 24 22:27 UTC] No.42159947{4}[source]▶

>>42155142 #

"Terrascan" is a vision model? The only hits I'm getting are for a static code analyzer.

replies(1): >>42176149 #

91. sinuhe69 ◴[17 Nov 24 03:44 UTC] No.42161830[source]▶

>>42154410 (OP) #

Very funny. I put in 3 screen captures of a (long) document, and it did relatively well. But when I proof-read it, I realized the AI has made up passages that were not there!

The reason is probably due to the nature of screen capturing, some sentences or paragraphs were cut short. That probably kicked off the “fill in the blank” nature of the LLM and it could not resist to leave these paragraphs stand unfinished :LOL. It even put in a short conclusion paragraph that was not in the original document at all!

replies(1): >>42162577 #

92. abenga ◴[17 Nov 24 07:32 UTC] No.42162577[source]▶

>>42161830 #

It boggles my mind that a technology where "making things up" is even a remote possibility is ever actually considered for use in the real world.

93. AmazingTurtle ◴[17 Nov 24 09:28 UTC] No.42163052{3}[source]▶

>>42156396 #

May I introduce you to `apache/tika:2.9.2.1-full` with a REST API on 9998.

replies(1): >>42163432 #

94. cess11 ◴[17 Nov 24 11:04 UTC] No.42163432{4}[source]▶

>>42163052 #

Not sure what you mean. Are they making Graal-builds you can run standalone now? I only use Tika through Maven at work, might not be up to date on what happens in the project.

95. timmattison ◴[17 Nov 24 12:22 UTC] No.42163822{5}[source]▶

>>42155615 #

I love this. Can you share the source?

96. danvk ◴[17 Nov 24 13:00 UTC] No.42163981{4}[source]▶

>>42156712 #

I have not, but that's a great idea!

97. wriggler ◴[17 Nov 24 15:33 UTC] No.42164693[source]▶

>>42155081 #

I'd love to hear how Handwriting OCR (https://www.handwritingocr.com) compares for your task.

It's not free, but its accuracy for for handwritten documents is the best out there (I am the founder, so am biased, but I'm really excited about where the accuracy is now). It could save you time and for your 100 page project would cost only $12.

replies(1): >>42169902 #

98. KetoManx64 ◴[18 Nov 24 05:16 UTC] No.42169902{3}[source]▶

>>42164693 #

My main qualm with a project like yours is that I have to upload my documents to a third party and trust them with that data. I have a couple thousand pages worth of journal entries from the last decade and I would never upload those to a website to get OCR'd, but with a local Ollama model I have full control of the data and it all stays local.

replies(1): >>42189859 #

99. bosie ◴[18 Nov 24 19:46 UTC] No.42176149{5}[source]▶

>>42159947 #

sorry, i meant "Tesseract"

100. wriggler ◴[20 Nov 24 01:12 UTC] No.42189859{4}[source]▶

>>42169902 #

I understand your concern, and it's a common one. However, we can only give assurances in our privacy policy that your data is used only to perform the OCR, and nothing else. You can delete all data from the server immediately after downloading your results and no trace will be left.

Of course a local solution like Ollama is preferable for privacy reasons but, for now, the OCR performance of available local models is just not very good, especially from handwritten documents. With a couple thousand pages of journal entries, that means a lot of post-processing and editing.

101. noduerme ◴[20 Nov 24 13:54 UTC] No.42193944{5}[source]▶

>>42155470 #

Not to get real dark and philosophical (but here goes) it took somewhere around 150,000 years for humans to go from spoken language to writing. And almost all of those words were irrational. From there to understanding and encoding what is or isn't provable, or is or isn't logically deterministic, took the last few hundred years. And people who have been steeped in looking at the world through that lens (whether you deal with pure math or need to understand, e.g. by running a casino, what is not deterministic, so as to add it to your understanding of volatility and risk) are able to identify which factors in any scenario are deterministic or not very quickly. One could almost say that this ability to discern logic from fuzz is the crowning achievement of science and civilization, and the main adaptation conferred upon some humans since speech. Unfortunately, it is very recent, and it's still an open question as to whether it's an evolutionary advantage to be able to tell the difference between magic and process. And yeah, it's scary to imagine a world where people can't; but that was practically the whole world a few centuries ago, and it wouldn't be terribly surprising if humanity regressed to that as they stopped understanding how to make tools and most people began treating tools like magic again. Sad time to be alive.

↑