←back to thread

550 points polskibus | 3 comments | | HN request time: 0s | source
Show context
jordan801 ◴[] No.19116099[source]
Anyone who has written a few scrappers knows how brutally ineffective this is. Yelp tried to pull the same thing and it took me about 3 minutes to rectify my "for fun" scraper. It's also really not that difficult to write a smart scraper that you say, "Look for these things in this post. However you find them, replicate it for the others". Which is ultimately what I made my Yelp scraper do.

If there's a pattern, I will find it, and I will exploit it. <3

replies(8): >>19116147 #>>19116340 #>>19116656 #>>19116724 #>>19117143 #>>19117402 #>>19117423 #>>19121248 #
AndrewKemendo ◴[] No.19116724[source]
That works for a single iteration, but if there are multiple implementations that are randomly chosen when rendered it's a lot harder.

Pretty easy to build a randomizing span algo that you can't hardcode.

replies(2): >>19116788 #>>19117148 #
1. osrec ◴[] No.19116788[source]
Not really. We have scraped many sites successfully that try this randomisation logic. There is always a pattern, which often can be determined via heuristics. It does make things trickier, but not impossible or especially difficult.
replies(1): >>19117111 #
2. Novashi ◴[] No.19117111[source]
It feels like the point is just to raise the difficulty for script kiddies.

After all, there’s always headless browsers and OCR

replies(1): >>19117138 #
3. osrec ◴[] No.19117138[source]
Yeah, it's a bit pointless really. If you're going to put data on the open web, you should be prepared for it to be copied.