←back to thread

379 points mobeigi | 1 comments | | HN request time: 0.001s | source
Show context
DanielHB ◴[] No.41869510[source]
I want to share a story in a somewhat related topic:

anti web-scraping techniques

The most devious version I ever seen of this, I was baffled, astonished and completely helpless:

This website I was trying to scrap generated a new font (as in a .woff file) on every request, the font had the position of the letters randomly moved around (for example, the 'J' would be in place of the 'F' character in the .woff and so on) and the text produced by the website would be encoded to match that specific font.

So every time you loaded the website you got a completely different font with a completely different text, but for the user the text would look fine because the font mapped it to the original characters. If you tried to copy-and-paste the text from the website you would get some random garbled text.

The only way I could think of to scrap that would have been to OCR the .woff font files, but OCR could easily prevent mass-scraping due to sheer processing costs.

replies(7): >>41869674 #>>41869684 #>>41869775 #>>41869796 #>>41869877 #>>41870330 #>>41871277 #
DaiPlusPlus ◴[] No.41869674[source]
> easily prevent mass-scraping due to sheer processing costs.

my 2018 iPad Pro does OCR on images in Safari instantly. People only think OCR is slow because Adobe Acrobat still uses the same single-threaded OCR algo it’s had for decades now; then consider how blazing a GPU-based impl would be…

replies(2): >>41870062 #>>41870521 #
jakjak123 ◴[] No.41870521[source]
It pre processes your photo library while charging
replies(1): >>41870659 #
ChadNauseam ◴[] No.41870659[source]
The GP mentioned it working for pictures viewed in safari
replies(2): >>41871218 #>>41877658 #
1. jakjak123 ◴[] No.41877658{3}[source]
Yeah, i was thinking more about why it looks like it works so fast when you browse your photo library