←back to thread

550 points polskibus | 1 comments | | HN request time: 0s | source
Show context
jordan801 ◴[] No.19116099[source]
Anyone who has written a few scrappers knows how brutally ineffective this is. Yelp tried to pull the same thing and it took me about 3 minutes to rectify my "for fun" scraper. It's also really not that difficult to write a smart scraper that you say, "Look for these things in this post. However you find them, replicate it for the others". Which is ultimately what I made my Yelp scraper do.

If there's a pattern, I will find it, and I will exploit it. <3

replies(8): >>19116147 #>>19116340 #>>19116656 #>>19116724 #>>19117143 #>>19117402 #>>19117423 #>>19121248 #
eeeeeeeeeeeee ◴[] No.19116147[source]
Yep, seems like a total waste of time. The people scraping will spend the necessary time to get around this (and then distribute that knowledge to the masses) so it seems like a pointless arms race. Facebook employees could better use their time on developing actual features that bring value.
replies(5): >>19116174 #>>19116250 #>>19116413 #>>19116676 #>>19117180 #
taf2 ◴[] No.19116676[source]
They could render the whole thing in canvas for example
replies(1): >>19116931 #
nacs ◴[] No.19116931[source]
So you'd block all canvas elements if ads are always a <canvas>.

If they turn all their posts into <canvas> then it'd kill any accessibility features and the ability to copy-paste text and such so I doubt they'd go that far.

Even then, a scraper could run OCR on the canvas image to get the text out of it.

replies(2): >>19117134 #>>19119999 #
1. sqd ◴[] No.19117134{3}[source]
I don't think these html pieces is very accessibility-tool friendly..