Most active commenters
  • mmastrac(3)

←back to thread

272 points lermontov | 15 comments | | HN request time: 1.109s | source | bottom
1. mmastrac ◴[] No.41906276[source]
I started a quick transcription here -- not enough time to complete more than half the first column, but some scans and very rough OCR are here if anyone is interested in contributing:

https://github.com/mmastrac/gibbet-hill

Top and bottom halves of the page in the repo here:

https://github.com/mmastrac/gibbet-hill/blob/main/scan-1.png https://github.com/mmastrac/gibbet-hill/blob/main/scan-2.png

EDIT: If you have access to a multi-modal LLM, the rough transcription + the column scan and the instruction to "OCR this text, keep linebreaks" gives a _very good_ result.

EDIT 2: Rough draft, needs some proofreading and corrections:

https://github.com/mmastrac/gibbet-hill/blob/main/story.md

replies(5): >>41906561 #>>41907098 #>>41907235 #>>41908097 #>>41908454 #
2. simonw ◴[] No.41906561[source]
I tried extracting the content using Google Gemini 1.5 Pro 002 using https://aistudio.google.com/ - the first page (scan-2) worked fantastically well, the second page not so much. Here's what I got so far: https://gist.github.com/simonw/ba87f507ef5c11d3335959c055533...
replies(1): >>41906687 #
3. mmastrac ◴[] No.41906687[source]
I cropped the columns out into six files -- it might have an easier time with these:

https://github.com/mmastrac/gibbet-hill/blob/main/col-1-a.pn...

replies(2): >>41907087 #>>41907203 #
4. quuxplusone ◴[] No.41907098[source]
Seems like you don't need an LLM, you just need a human who (1) likes reading Stoker and (2) touch-types. :) I'd volunteer, if I didn't think I'd be duplicating effort at this point.

(I've transcribed various things over the years, including Sonia Greene's Alcestis [1] and Holtzman & Kershenblatt's "Castlequest" source code [2], so I know it doesn't take much except quick fingers and sufficient motivation. :))

[1] https://quuxplusone.github.io/blog/2022/10/22/alcestis/

[2] https://quuxplusone.github.io/blog/2021/03/09/castlequest/

EDIT: ...and as I was writing that, you seem to have finished your transcription. :)

replies(2): >>41907134 #>>41911812 #
5. mmastrac ◴[] No.41907134[source]
I finished a very rough, tesseract + LLM transcription, but it absolutely needs editing passes.

I've done transcription in the past myself (did two books for standard ebooks with some from-scratch transcription and lots of editing) and I know the pain. I've always found it easier to fix up OCR than type the whole thing by hand because I've found my error rate of eyeball transcription to be higher.

If you want to tackle the proofing passes, I'm happy to add you to the repo :)

replies(1): >>41908207 #
6. reaperducer ◴[] No.41907203{3}[source]
…and my wife's Halloween present has been printed.

Tip: Load the pngs into Preview, hit "Auto Levels," and crank up "Sharpness" on each one. Looks pretty good!

7. ◴[] No.41907235[source]
8. 1317 ◴[] No.41908097[source]
probably you would want to get the project gutenberg people onto it
replies(1): >>41909486 #
9. wahnfrieden ◴[] No.41908207{3}[source]
Use LiveText API. Much much better accuracy than Tesseract. You can rent access to it.
10. cxr ◴[] No.41908454[source]
Too late. You have already been scooped by, of course, tumblr:

<https://woodsfae.tumblr.com/post/764918993659330560/gibbet-h...>

replies(2): >>41909099 #>>41909446 #
11. oliyoung ◴[] No.41909099[source]
A battle of a Tumblr user named Woodsfae versus advanced LLM transcribing new goth literature?

That's like bringing a knife to a gun fight my friend, never underestimate the power of a committed Tumblr user

12. drivers99 ◴[] No.41909446[source]
In the scan, where it says "and shortly came to the edge of the Punchbowl and easted my eyes on its beauty" OP changed "easted" to "cast" and the tumbler one says "easted[sic]" ([sic] is theirs). I wonder if it's supposed to be "feasted".
13. mNovak ◴[] No.41909486[source]
I went ahead and made a post over at the PG proofreaders site (pgdp.net) to make them aware.
14. eru ◴[] No.41911812[source]
> Seems like you don't need an LLM, you just need a human who (1) likes reading Stoker and (2) touch-types.

LLMs are increasingly becoming cheaper and more accessible than humans with a baseline of literacy.

replies(1): >>41912668 #
15. notachatbot123 ◴[] No.41912668{3}[source]
They are also nowhere as good. Not everything has to be solved by cheap* technological processes.

*: If you ignore the environmental costs.