←back to thread

550 points polskibus | 5 comments | | HN request time: 0.002s | source
Show context
jordan801 ◴[] No.19116099[source]
Anyone who has written a few scrappers knows how brutally ineffective this is. Yelp tried to pull the same thing and it took me about 3 minutes to rectify my "for fun" scraper. It's also really not that difficult to write a smart scraper that you say, "Look for these things in this post. However you find them, replicate it for the others". Which is ultimately what I made my Yelp scraper do.

If there's a pattern, I will find it, and I will exploit it. <3

replies(8): >>19116147 #>>19116340 #>>19116656 #>>19116724 #>>19117143 #>>19117402 #>>19117423 #>>19121248 #
1. AndrewKemendo ◴[] No.19116724[source]
That works for a single iteration, but if there are multiple implementations that are randomly chosen when rendered it's a lot harder.

Pretty easy to build a randomizing span algo that you can't hardcode.

replies(2): >>19116788 #>>19117148 #
2. osrec ◴[] No.19116788[source]
Not really. We have scraped many sites successfully that try this randomisation logic. There is always a pattern, which often can be determined via heuristics. It does make things trickier, but not impossible or especially difficult.
replies(1): >>19117111 #
3. Novashi ◴[] No.19117111[source]
It feels like the point is just to raise the difficulty for script kiddies.

After all, there’s always headless browsers and OCR

replies(1): >>19117138 #
4. osrec ◴[] No.19117138{3}[source]
Yeah, it's a bit pointless really. If you're going to put data on the open web, you should be prepared for it to be copied.
5. lazopm ◴[] No.19117148[source]
I think you can just iterate over the text nodes and see if you stumble upon every letter you're looking for in the right order, it would work for any kind of randomly added text.