Facebook adds 5 divs, 9 spans and 30 CSS classes to every post in the timeline

(twitter.com)

Show context

jordan801 ◴[08 Feb 19 16:53 UTC] No.19116099[source]▶

>>19115460 (OP) #

Anyone who has written a few scrappers knows how brutally ineffective this is. Yelp tried to pull the same thing and it took me about 3 minutes to rectify my "for fun" scraper. It's also really not that difficult to write a smart scraper that you say, "Look for these things in this post. However you find them, replicate it for the others". Which is ultimately what I made my Yelp scraper do.

If there's a pattern, I will find it, and I will exploit it. <3

replies(8): >>19116147 #>>19116340 #>>19116656 #>>19116724 #>>19117143 #>>19117402 #>>19117423 #>>19121248 #

1. folkhack ◴[08 Feb 19 17:38 UTC] No.19116656[source]▶

>>19116099 #

100% true. Have written PLENTY of scrapers and methods like this are ultimately ineffective.

Even if you absolutely mangled the HTML/selectors/DOM/etc. I feel you could always have it process screenshots of the interfaces to rip text/figure out how to interact etc. If it's human readable, it's bot readable imo. (but in years of botting it's never came to this - I've always been able to figure out how to use the existing DOM/selectors to do my work even with anti-bot measures)

replies(1): >>19116828 #

2. chucksmash ◴[08 Feb 19 17:54 UTC] No.19116828[source]▶

>>19116656 (TP) #

+1. At a previous employer we fed images of interest from the web into Google's OCR API to see what we could see. In addition to scene descriptions, the API will transcribe any text it detects.

With all the easy to use tools available to programmers today, it would not be terribly hard to use OCR on a screenshot to find the text of interest and derive the scraping code by searching for the OCR'd text in the markup.

If none of your extant parsers can extract the info you want from the page, send it to OCR pipeline (or, hell, Mechanical Turk) and generate a new one.

replies(1): >>19117833 #

3. folkhack ◴[08 Feb 19 19:32 UTC] No.19117833[source]▶

>>19116828 #

Yep yep - if the text isn't distorted I can rip it from an image within minutes using pre-built OCR libraries. If the text is distorted there's full-blown API-driven services for solving CAPTCHAs and the like.

replies(1): >>19119468 #

4. TimothyBJacobs ◴[08 Feb 19 22:56 UTC] No.19119468{3}[source]▶

>>19117833 #

It seems like a time span of minutes wouldn't be fast enough for on-the-fly blocking of sponsored posts?

replies(1): >>19119497 #

5. folkhack ◴[08 Feb 19 23:03 UTC] No.19119497{4}[source]▶

>>19119468 #

Oh yea - I guess I had a specific use case in mind when I said that =)

What I meant is that I can hammer out some Node/Python that will grab an image w/text and put it through OCR for character extraction. "Programming" it would take me a handful of minutes.

replies(1): >>19123648 #

6. TimothyBJacobs ◴[09 Feb 19 18:05 UTC] No.19123648{5}[source]▶

>>19119497 #

Ahh, that makes sense!

↑