(twitter.com)

550 points polskibus | 3 comments | 08 Feb 19 16:04 UTC | HN request time: 0s | source

Show context

jordan801 ◴[08 Feb 19 16:53 UTC] No.19116099[source]▶

>>19115460 (OP) #

Anyone who has written a few scrappers knows how brutally ineffective this is. Yelp tried to pull the same thing and it took me about 3 minutes to rectify my "for fun" scraper. It's also really not that difficult to write a smart scraper that you say, "Look for these things in this post. However you find them, replicate it for the others". Which is ultimately what I made my Yelp scraper do.

If there's a pattern, I will find it, and I will exploit it. <3

replies(8): >>19116147 #>>19116340 #>>19116656 #>>19116724 #>>19117143 #>>19117402 #>>19117423 #>>19121248 #

AndrewKemendo ◴[08 Feb 19 17:43 UTC] No.19116724[source]▶

>>19116099 #

That works for a single iteration, but if there are multiple implementations that are randomly chosen when rendered it's a lot harder.

Pretty easy to build a randomizing span algo that you can't hardcode.

replies(2): >>19116788 #>>19117148 #

1. osrec ◴[08 Feb 19 17:51 UTC] No.19116788[source]▶

>>19116724 #

Not really. We have scraped many sites successfully that try this randomisation logic. There is always a pattern, which often can be determined via heuristics. It does make things trickier, but not impossible or especially difficult.

replies(1): >>19117111 #

2. Novashi ◴[08 Feb 19 18:21 UTC] No.19117111[source]▶

>>19116788 (TP) #

It feels like the point is just to raise the difficulty for script kiddies.

After all, there’s always headless browsers and OCR

replies(1): >>19117138 #

3. osrec ◴[08 Feb 19 18:24 UTC] No.19117138[source]▶

>>19117111 #

Yeah, it's a bit pointless really. If you're going to put data on the open web, you should be prepared for it to be copied.

↑

Facebook adds 5 divs, 9 spans and 30 CSS classes to every post in the timeline