←back to thread

550 points polskibus | 6 comments | | HN request time: 0s | source | bottom
Show context
jordan801 ◴[] No.19116099[source]
Anyone who has written a few scrappers knows how brutally ineffective this is. Yelp tried to pull the same thing and it took me about 3 minutes to rectify my "for fun" scraper. It's also really not that difficult to write a smart scraper that you say, "Look for these things in this post. However you find them, replicate it for the others". Which is ultimately what I made my Yelp scraper do.

If there's a pattern, I will find it, and I will exploit it. <3

replies(8): >>19116147 #>>19116340 #>>19116656 #>>19116724 #>>19117143 #>>19117402 #>>19117423 #>>19121248 #
1. folkhack ◴[] No.19116656[source]
100% true. Have written PLENTY of scrapers and methods like this are ultimately ineffective.

Even if you absolutely mangled the HTML/selectors/DOM/etc. I feel you could always have it process screenshots of the interfaces to rip text/figure out how to interact etc. If it's human readable, it's bot readable imo. (but in years of botting it's never came to this - I've always been able to figure out how to use the existing DOM/selectors to do my work even with anti-bot measures)

replies(1): >>19116828 #
2. chucksmash ◴[] No.19116828[source]
+1. At a previous employer we fed images of interest from the web into Google's OCR API to see what we could see. In addition to scene descriptions, the API will transcribe any text it detects.

With all the easy to use tools available to programmers today, it would not be terribly hard to use OCR on a screenshot to find the text of interest and derive the scraping code by searching for the OCR'd text in the markup.

If none of your extant parsers can extract the info you want from the page, send it to OCR pipeline (or, hell, Mechanical Turk) and generate a new one.

replies(1): >>19117833 #
3. folkhack ◴[] No.19117833[source]
Yep yep - if the text isn't distorted I can rip it from an image within minutes using pre-built OCR libraries. If the text is distorted there's full-blown API-driven services for solving CAPTCHAs and the like.
replies(1): >>19119468 #
4. TimothyBJacobs ◴[] No.19119468{3}[source]
It seems like a time span of minutes wouldn't be fast enough for on-the-fly blocking of sponsored posts?
replies(1): >>19119497 #
5. folkhack ◴[] No.19119497{4}[source]
Oh yea - I guess I had a specific use case in mind when I said that =)

What I meant is that I can hammer out some Node/Python that will grab an image w/text and put it through OCR for character extraction. "Programming" it would take me a handful of minutes.

replies(1): >>19123648 #
6. TimothyBJacobs ◴[] No.19123648{5}[source]
Ahh, that makes sense!