Most active commenters
  • folkhack(3)

←back to thread

550 points polskibus | 34 comments | | HN request time: 0.244s | source | bottom
1. jordan801 ◴[] No.19116099[source]
Anyone who has written a few scrappers knows how brutally ineffective this is. Yelp tried to pull the same thing and it took me about 3 minutes to rectify my "for fun" scraper. It's also really not that difficult to write a smart scraper that you say, "Look for these things in this post. However you find them, replicate it for the others". Which is ultimately what I made my Yelp scraper do.

If there's a pattern, I will find it, and I will exploit it. <3

replies(8): >>19116147 #>>19116340 #>>19116656 #>>19116724 #>>19117143 #>>19117402 #>>19117423 #>>19121248 #
2. eeeeeeeeeeeee ◴[] No.19116147[source]
Yep, seems like a total waste of time. The people scraping will spend the necessary time to get around this (and then distribute that knowledge to the masses) so it seems like a pointless arms race. Facebook employees could better use their time on developing actual features that bring value.
replies(5): >>19116174 #>>19116250 #>>19116413 #>>19116676 #>>19117180 #
3. pastor_elm ◴[] No.19116174[source]
Value doesn't keep the lights on.
replies(1): >>19116211 #
4. IntelMiner ◴[] No.19116211{3}[source]
Uh. Yes it does?

You can't make money off something without any value to people. People have to "want" the thing

5. matz1 ◴[] No.19116250[source]
How is this waste of time / pointless ? They do bring value for fb, they reduce the loss of ad money.
replies(1): >>19116407 #
6. nradov ◴[] No.19116340[source]
This would make an interesting blog post, if you have time to write up the design.
7. eeeeeeeeeeeee ◴[] No.19116407{3}[source]
Because they are fighting the people that absolutely do not want to see advertisements whatsoever (most normal people I know do not have ad-blockers), which already makes them the least likely to be attracted by advertisements. If anything, those brands are souring their reputation anytime they are seen by those kind of people in a medium that those consumers find obnoxious.

You can make a similar argument that the RIAA/MPAA going after piracy is a waste of time. Again, focus on delivering value to actual customers.

If Facebook spent more time on making a friendly ecosystem / community I would be more open to signing back up again. Instead, it seems they are hyper-focused on advertisements at the expense of everything else.

replies(2): >>19117030 #>>19117034 #
8. SketchySeaBeast ◴[] No.19116413[source]
> Facebook employees could better use their time on developing actual features that bring value.

The problem here of course is that circumventing the ad blocking is the most direct way Facebook can find value.

9. folkhack ◴[] No.19116656[source]
100% true. Have written PLENTY of scrapers and methods like this are ultimately ineffective.

Even if you absolutely mangled the HTML/selectors/DOM/etc. I feel you could always have it process screenshots of the interfaces to rip text/figure out how to interact etc. If it's human readable, it's bot readable imo. (but in years of botting it's never came to this - I've always been able to figure out how to use the existing DOM/selectors to do my work even with anti-bot measures)

replies(1): >>19116828 #
10. taf2 ◴[] No.19116676[source]
They could render the whole thing in canvas for example
replies(1): >>19116931 #
11. AndrewKemendo ◴[] No.19116724[source]
That works for a single iteration, but if there are multiple implementations that are randomly chosen when rendered it's a lot harder.

Pretty easy to build a randomizing span algo that you can't hardcode.

replies(2): >>19116788 #>>19117148 #
12. osrec ◴[] No.19116788[source]
Not really. We have scraped many sites successfully that try this randomisation logic. There is always a pattern, which often can be determined via heuristics. It does make things trickier, but not impossible or especially difficult.
replies(1): >>19117111 #
13. chucksmash ◴[] No.19116828[source]
+1. At a previous employer we fed images of interest from the web into Google's OCR API to see what we could see. In addition to scene descriptions, the API will transcribe any text it detects.

With all the easy to use tools available to programmers today, it would not be terribly hard to use OCR on a screenshot to find the text of interest and derive the scraping code by searching for the OCR'd text in the markup.

If none of your extant parsers can extract the info you want from the page, send it to OCR pipeline (or, hell, Mechanical Turk) and generate a new one.

replies(1): >>19117833 #
14. nacs ◴[] No.19116931{3}[source]
So you'd block all canvas elements if ads are always a <canvas>.

If they turn all their posts into <canvas> then it'd kill any accessibility features and the ability to copy-paste text and such so I doubt they'd go that far.

Even then, a scraper could run OCR on the canvas image to get the text out of it.

replies(2): >>19117134 #>>19119999 #
15. rchaud ◴[] No.19117030{4}[source]
"Delivering value to customers"? It's a website-cum-message board. There is no further value to deliver via software features; all of that work is going towards meeting the needs of advertisers.

And advertisers are asking them to show what they are doing to combat adblocking. FB isn't doing this to target customers least likely to convert, they're doing it to check a box for their ad sales team.

replies(1): >>19117717 #
16. recursive ◴[] No.19117034{4}[source]
> which already makes them the least likely to be attracted by advertisements. If anything, those brands are souring their reputation anytime they are seen by those kind of people in a medium that those consumers find obnoxious.

Facebook thinks otherwise. Between you and them, I suspect they are more likely to have some evidence or trials to support their position.

17. Novashi ◴[] No.19117111{3}[source]
It feels like the point is just to raise the difficulty for script kiddies.

After all, there’s always headless browsers and OCR

replies(1): >>19117138 #
18. sqd ◴[] No.19117134{4}[source]
I don't think these html pieces is very accessibility-tool friendly..
19. osrec ◴[] No.19117138{4}[source]
Yeah, it's a bit pointless really. If you're going to put data on the open web, you should be prepared for it to be copied.
20. danillonunes ◴[] No.19117143[source]
Ad blockers are not as powerful as a scrapper, they basically a glorified CSS selector engine. This may be easy to bypass for a dedicated extension targeting Facebook, such as Social Fixer, but for a regular ad blocker it may be harder.
21. lazopm ◴[] No.19117148[source]
I think you can just iterate over the text nodes and see if you stumble upon every letter you're looking for in the right order, it would work for any kind of randomly added text.
22. ◴[] No.19117180[source]
23. singron ◴[] No.19117402[source]
It actually seems pretty effective in this case. This uBlock Origin issue has been open for half a year and none of the default lists block the ads: https://github.com/uBlockOrigin/uAssets/issues/3367
replies(1): >>19118064 #
24. MarsAscendant ◴[] No.19117423[source]
> it took me about 3 minutes to rectify my "for fun" scraper.

Did you mean to say "rectify" as in "fix/adjust"? It sounds like you might have meant "reify" – as in, "create" – but I don't know whether you had the scrapper before that.

25. otabdeveloper2 ◴[] No.19117717{5}[source]
Newsflash: Facebook's customers are advertisers, not the honeypot victims stuck to their site.
replies(1): >>19118286 #
26. folkhack ◴[] No.19117833{3}[source]
Yep yep - if the text isn't distorted I can rip it from an image within minutes using pre-built OCR libraries. If the text is distorted there's full-blown API-driven services for solving CAPTCHAs and the like.
replies(1): >>19119468 #
27. robryan ◴[] No.19118064[source]
Extensions like ublock having public block lists makes it even easier for Facebook to write something that breaks the filter.
replies(1): >>19118386 #
28. tomatocracy ◴[] No.19118286{6}[source]
But there’s a trade off. Fewer ads better placed means fewer people likely to block them and more people likely to actually look at them instead of skim past. It’s also likely that the people who were blocking ads are less likely to click on them and thus (for some advertisers) lower value impressions anyway.

Revenue is (simplistically) the product of impressions and value per impression. It’s not therefore immediately obvious that moves like this actually do increase revenue for them, especially over the long term since one potential side effect of doing this is giving more ammunition to the ‘delete Facebook’ crowd.

29. IshKebab ◴[] No.19118386{3}[source]
Yeah and Facebook has far more resources than uBlock to fight the cat-and-mouse. So it is effective.
30. TimothyBJacobs ◴[] No.19119468{4}[source]
It seems like a time span of minutes wouldn't be fast enough for on-the-fly blocking of sponsored posts?
replies(1): >>19119497 #
31. folkhack ◴[] No.19119497{5}[source]
Oh yea - I guess I had a specific use case in mind when I said that =)

What I meant is that I can hammer out some Node/Python that will grab an image w/text and put it through OCR for character extraction. "Programming" it would take me a handful of minutes.

replies(1): >>19123648 #
32. taf2 ◴[] No.19119999{4}[source]
And then you’d switch to an img tag
33. cauk ◴[] No.19121248[source]
Hey there. Regarding the semi-automatic “look for these things in the post, and however you find them replicate for others”. I’m new to scrappers, do you have a good resource you could link on this? Thanks!
34. TimothyBJacobs ◴[] No.19123648{6}[source]
Ahh, that makes sense!