←back to thread

323 points lermontov | 6 comments | | HN request time: 0s | source | bottom
Show context
mmastrac ◴[] No.41906276[source]
I started a quick transcription here -- not enough time to complete more than half the first column, but some scans and very rough OCR are here if anyone is interested in contributing:

https://github.com/mmastrac/gibbet-hill

Top and bottom halves of the page in the repo here:

https://github.com/mmastrac/gibbet-hill/blob/main/scan-1.png https://github.com/mmastrac/gibbet-hill/blob/main/scan-2.png

EDIT: If you have access to a multi-modal LLM, the rough transcription + the column scan and the instruction to "OCR this text, keep linebreaks" gives a _very good_ result.

EDIT 2: Rough draft, needs some proofreading and corrections:

https://github.com/mmastrac/gibbet-hill/blob/main/story.md

replies(6): >>41906561 #>>41907098 #>>41907235 #>>41908097 #>>41908454 #>>41918290 #
quuxplusone ◴[] No.41907098[source]
Seems like you don't need an LLM, you just need a human who (1) likes reading Stoker and (2) touch-types. :) I'd volunteer, if I didn't think I'd be duplicating effort at this point.

(I've transcribed various things over the years, including Sonia Greene's Alcestis [1] and Holtzman & Kershenblatt's "Castlequest" source code [2], so I know it doesn't take much except quick fingers and sufficient motivation. :))

[1] https://quuxplusone.github.io/blog/2022/10/22/alcestis/

[2] https://quuxplusone.github.io/blog/2021/03/09/castlequest/

EDIT: ...and as I was writing that, you seem to have finished your transcription. :)

replies(2): >>41907134 #>>41911812 #
mmastrac ◴[] No.41907134[source]
I finished a very rough, tesseract + LLM transcription, but it absolutely needs editing passes.

I've done transcription in the past myself (did two books for standard ebooks with some from-scratch transcription and lots of editing) and I know the pain. I've always found it easier to fix up OCR than type the whole thing by hand because I've found my error rate of eyeball transcription to be higher.

If you want to tackle the proofing passes, I'm happy to add you to the repo :)

replies(1): >>41908207 #
wahnfrieden ◴[] No.41908207[source]
Use LiveText API. Much much better accuracy than Tesseract. You can rent access to it.
replies(1): >>41913194 #
1. CoastalCoder ◴[] No.41913194[source]
Anyone know why the parent comment would be downvoted?

I know nothing about OCR, so maybe it's obvious to others.

replies(2): >>41915337 #>>41916171 #
2. dylan604 ◴[] No.41915337[source]
It reads as a drive by advertisement. Also, people like full local vs renting as a default.
replies(1): >>41916094 #
3. wahnfrieden ◴[] No.41916094[source]
It is full local and offline. I also don’t work for Apple or own shares in case you thought I have a financial stake in mentioning their products
replies(1): >>41917489 #
4. wahnfrieden ◴[] No.41916171[source]
No good reason. Votes come from clueless people too. Don’t trust votes on HN
5. dylan604 ◴[] No.41917489{3}[source]
I didn't down vote. I answered the other question. Your comment reads like an ad. That tends to get downvoted. This entire conversation about why the down votes should be down voted

also, if it is full local and offline why does it need to be rented?

replies(1): >>41917949 #
6. wahnfrieden ◴[] No.41917949{4}[source]
it doesn't need to be rented if you have an apple device