(twitter.com)

550 points polskibus | 5 comments | 08 Feb 19 16:04 UTC | HN request time: 0.002s | source

Show context

jordan801 ◴[08 Feb 19 16:53 UTC] No.19116099[source]▶

>>19115460 (OP) #

Anyone who has written a few scrappers knows how brutally ineffective this is. Yelp tried to pull the same thing and it took me about 3 minutes to rectify my "for fun" scraper. It's also really not that difficult to write a smart scraper that you say, "Look for these things in this post. However you find them, replicate it for the others". Which is ultimately what I made my Yelp scraper do.

If there's a pattern, I will find it, and I will exploit it. <3

replies(8): >>19116147 #>>19116340 #>>19116656 #>>19116724 #>>19117143 #>>19117402 #>>19117423 #>>19121248 #

1. AndrewKemendo ◴[08 Feb 19 17:43 UTC] No.19116724[source]▶

>>19116099 #

That works for a single iteration, but if there are multiple implementations that are randomly chosen when rendered it's a lot harder.

Pretty easy to build a randomizing span algo that you can't hardcode.

replies(2): >>19116788 #>>19117148 #

2. osrec ◴[08 Feb 19 17:51 UTC] No.19116788[source]▶

>>19116724 (TP) #

Not really. We have scraped many sites successfully that try this randomisation logic. There is always a pattern, which often can be determined via heuristics. It does make things trickier, but not impossible or especially difficult.

replies(1): >>19117111 #

3. Novashi ◴[08 Feb 19 18:21 UTC] No.19117111[source]▶

>>19116788 #

It feels like the point is just to raise the difficulty for script kiddies.

After all, there’s always headless browsers and OCR

replies(1): >>19117138 #

4. osrec ◴[08 Feb 19 18:24 UTC] No.19117138{3}[source]▶

>>19117111 #

Yeah, it's a bit pointless really. If you're going to put data on the open web, you should be prepared for it to be copied.

5. lazopm ◴[08 Feb 19 18:25 UTC] No.19117148[source]▶

>>19116724 (TP) #

I think you can just iterate over the text nodes and see if you stumble upon every letter you're looking for in the right order, it would work for any kind of randomly added text.

↑

Facebook adds 5 divs, 9 spans and 30 CSS classes to every post in the timeline