←back to thread

272 points lermontov | 1 comments | | HN request time: 0s | source
Show context
mmastrac ◴[] No.41906276[source]
I started a quick transcription here -- not enough time to complete more than half the first column, but some scans and very rough OCR are here if anyone is interested in contributing:

https://github.com/mmastrac/gibbet-hill

Top and bottom halves of the page in the repo here:

https://github.com/mmastrac/gibbet-hill/blob/main/scan-1.png https://github.com/mmastrac/gibbet-hill/blob/main/scan-2.png

EDIT: If you have access to a multi-modal LLM, the rough transcription + the column scan and the instruction to "OCR this text, keep linebreaks" gives a _very good_ result.

EDIT 2: Rough draft, needs some proofreading and corrections:

https://github.com/mmastrac/gibbet-hill/blob/main/story.md

replies(5): >>41906561 #>>41907098 #>>41907235 #>>41908097 #>>41908454 #
quuxplusone ◴[] No.41907098[source]
Seems like you don't need an LLM, you just need a human who (1) likes reading Stoker and (2) touch-types. :) I'd volunteer, if I didn't think I'd be duplicating effort at this point.

(I've transcribed various things over the years, including Sonia Greene's Alcestis [1] and Holtzman & Kershenblatt's "Castlequest" source code [2], so I know it doesn't take much except quick fingers and sufficient motivation. :))

[1] https://quuxplusone.github.io/blog/2022/10/22/alcestis/

[2] https://quuxplusone.github.io/blog/2021/03/09/castlequest/

EDIT: ...and as I was writing that, you seem to have finished your transcription. :)

replies(2): >>41907134 #>>41911812 #
mmastrac ◴[] No.41907134[source]
I finished a very rough, tesseract + LLM transcription, but it absolutely needs editing passes.

I've done transcription in the past myself (did two books for standard ebooks with some from-scratch transcription and lots of editing) and I know the pain. I've always found it easier to fix up OCR than type the whole thing by hand because I've found my error rate of eyeball transcription to be higher.

If you want to tackle the proofing passes, I'm happy to add you to the repo :)

replies(1): >>41908207 #
1. wahnfrieden ◴[] No.41908207[source]
Use LiveText API. Much much better accuracy than Tesseract. You can rent access to it.