Most active commenters

folkhack(3)

Popular/hot comments

>>19116147 #

←back to thread

Facebook adds 5 divs, 9 spans and 30 CSS classes to every post in the timeline

(twitter.com)

1. jordan801 ◴[08 Feb 19 16:53 UTC] No.19116099[source]▶

>>19115460 (OP) #

Anyone who has written a few scrappers knows how brutally ineffective this is. Yelp tried to pull the same thing and it took me about 3 minutes to rectify my "for fun" scraper. It's also really not that difficult to write a smart scraper that you say, "Look for these things in this post. However you find them, replicate it for the others". Which is ultimately what I made my Yelp scraper do.

If there's a pattern, I will find it, and I will exploit it. <3

replies(8): >>19116147 #>>19116340 #>>19116656 #>>19116724 #>>19117143 #>>19117402 #>>19117423 #>>19121248 #

2. eeeeeeeeeeeee ◴[08 Feb 19 16:56 UTC] No.19116147[source]▶

>>19116099 (TP) #

Yep, seems like a total waste of time. The people scraping will spend the necessary time to get around this (and then distribute that knowledge to the masses) so it seems like a pointless arms race. Facebook employees could better use their time on developing actual features that bring value.

replies(5): >>19116174 #>>19116250 #>>19116413 #>>19116676 #>>19117180 #

3. pastor_elm ◴[08 Feb 19 16:58 UTC] No.19116174[source]▶

>>19116147 #

Value doesn't keep the lights on.

replies(1): >>19116211 #

4. IntelMiner ◴[08 Feb 19 17:01 UTC] No.19116211{3}[source]▶

>>19116174 #

Uh. Yes it does?

You can't make money off something without any value to people. People have to "want" the thing

5. matz1 ◴[08 Feb 19 17:04 UTC] No.19116250[source]▶

>>19116147 #

How is this waste of time / pointless ? They do bring value for fb, they reduce the loss of ad money.

replies(1): >>19116407 #

6. nradov ◴[08 Feb 19 17:11 UTC] No.19116340[source]▶

>>19116099 (TP) #

This would make an interesting blog post, if you have time to write up the design.

7. eeeeeeeeeeeee ◴[08 Feb 19 17:17 UTC] No.19116407{3}[source]▶

>>19116250 #

Because they are fighting the people that absolutely do not want to see advertisements whatsoever (most normal people I know do not have ad-blockers), which already makes them the least likely to be attracted by advertisements. If anything, those brands are souring their reputation anytime they are seen by those kind of people in a medium that those consumers find obnoxious.

You can make a similar argument that the RIAA/MPAA going after piracy is a waste of time. Again, focus on delivering value to actual customers.

If Facebook spent more time on making a friendly ecosystem / community I would be more open to signing back up again. Instead, it seems they are hyper-focused on advertisements at the expense of everything else.

replies(2): >>19117030 #>>19117034 #

8. SketchySeaBeast ◴[08 Feb 19 17:18 UTC] No.19116413[source]▶

>>19116147 #

> Facebook employees could better use their time on developing actual features that bring value.

The problem here of course is that circumventing the ad blocking is the most direct way Facebook can find value.

9. folkhack ◴[08 Feb 19 17:38 UTC] No.19116656[source]▶

>>19116099 (TP) #

100% true. Have written PLENTY of scrapers and methods like this are ultimately ineffective.

Even if you absolutely mangled the HTML/selectors/DOM/etc. I feel you could always have it process screenshots of the interfaces to rip text/figure out how to interact etc. If it's human readable, it's bot readable imo. (but in years of botting it's never came to this - I've always been able to figure out how to use the existing DOM/selectors to do my work even with anti-bot measures)

replies(1): >>19116828 #

10. taf2 ◴[08 Feb 19 17:39 UTC] No.19116676[source]▶

>>19116147 #

They could render the whole thing in canvas for example

replies(1): >>19116931 #

11. AndrewKemendo ◴[08 Feb 19 17:43 UTC] No.19116724[source]▶

>>19116099 (TP) #

That works for a single iteration, but if there are multiple implementations that are randomly chosen when rendered it's a lot harder.

Pretty easy to build a randomizing span algo that you can't hardcode.

replies(2): >>19116788 #>>19117148 #

12. osrec ◴[08 Feb 19 17:51 UTC] No.19116788[source]▶

>>19116724 #

Not really. We have scraped many sites successfully that try this randomisation logic. There is always a pattern, which often can be determined via heuristics. It does make things trickier, but not impossible or especially difficult.

replies(1): >>19117111 #

13. chucksmash ◴[08 Feb 19 17:54 UTC] No.19116828[source]▶

>>19116656 #

+1. At a previous employer we fed images of interest from the web into Google's OCR API to see what we could see. In addition to scene descriptions, the API will transcribe any text it detects.

With all the easy to use tools available to programmers today, it would not be terribly hard to use OCR on a screenshot to find the text of interest and derive the scraping code by searching for the OCR'd text in the markup.

If none of your extant parsers can extract the info you want from the page, send it to OCR pipeline (or, hell, Mechanical Turk) and generate a new one.

replies(1): >>19117833 #

14. nacs ◴[08 Feb 19 18:02 UTC] No.19116931{3}[source]▶

>>19116676 #

So you'd block all canvas elements if ads are always a <canvas>.

If they turn all their posts into <canvas> then it'd kill any accessibility features and the ability to copy-paste text and such so I doubt they'd go that far.

Even then, a scraper could run OCR on the canvas image to get the text out of it.

replies(2): >>19117134 #>>19119999 #

15. rchaud ◴[08 Feb 19 18:13 UTC] No.19117030{4}[source]▶

>>19116407 #

"Delivering value to customers"? It's a website-cum-message board. There is no further value to deliver via software features; all of that work is going towards meeting the needs of advertisers.

And advertisers are asking them to show what they are doing to combat adblocking. FB isn't doing this to target customers least likely to convert, they're doing it to check a box for their ad sales team.

replies(1): >>19117717 #

16. recursive ◴[08 Feb 19 18:14 UTC] No.19117034{4}[source]▶

>>19116407 #

> which already makes them the least likely to be attracted by advertisements. If anything, those brands are souring their reputation anytime they are seen by those kind of people in a medium that those consumers find obnoxious.

Facebook thinks otherwise. Between you and them, I suspect they are more likely to have some evidence or trials to support their position.

17. Novashi ◴[08 Feb 19 18:21 UTC] No.19117111{3}[source]▶

>>19116788 #

It feels like the point is just to raise the difficulty for script kiddies.

After all, there’s always headless browsers and OCR

replies(1): >>19117138 #

18. sqd ◴[08 Feb 19 18:23 UTC] No.19117134{4}[source]▶

>>19116931 #

I don't think these html pieces is very accessibility-tool friendly..

19. osrec ◴[08 Feb 19 18:24 UTC] No.19117138{4}[source]▶

>>19117111 #

Yeah, it's a bit pointless really. If you're going to put data on the open web, you should be prepared for it to be copied.

20. danillonunes ◴[08 Feb 19 18:25 UTC] No.19117143[source]▶

>>19116099 (TP) #

Ad blockers are not as powerful as a scrapper, they basically a glorified CSS selector engine. This may be easy to bypass for a dedicated extension targeting Facebook, such as Social Fixer, but for a regular ad blocker it may be harder.

21. lazopm ◴[08 Feb 19 18:25 UTC] No.19117148[source]▶

>>19116724 #

I think you can just iterate over the text nodes and see if you stumble upon every letter you're looking for in the right order, it would work for any kind of randomly added text.

22. ◴[08 Feb 19 18:28 UTC] No.19117180[source]▶

>>19116147 #

23. singron ◴[08 Feb 19 18:48 UTC] No.19117402[source]▶

>>19116099 (TP) #

It actually seems pretty effective in this case. This uBlock Origin issue has been open for half a year and none of the default lists block the ads: https://github.com/uBlockOrigin/uAssets/issues/3367

replies(1): >>19118064 #

24. MarsAscendant ◴[08 Feb 19 18:50 UTC] No.19117423[source]▶

>>19116099 (TP) #

> it took me about 3 minutes to rectify my "for fun" scraper.

Did you mean to say "rectify" as in "fix/adjust"? It sounds like you might have meant "reify" – as in, "create" – but I don't know whether you had the scrapper before that.

25. otabdeveloper2 ◴[08 Feb 19 19:21 UTC] No.19117717{5}[source]▶

>>19117030 #

Newsflash: Facebook's customers are advertisers, not the honeypot victims stuck to their site.

replies(1): >>19118286 #

26. folkhack ◴[08 Feb 19 19:32 UTC] No.19117833{3}[source]▶

>>19116828 #

Yep yep - if the text isn't distorted I can rip it from an image within minutes using pre-built OCR libraries. If the text is distorted there's full-blown API-driven services for solving CAPTCHAs and the like.

replies(1): >>19119468 #

27. robryan ◴[08 Feb 19 19:58 UTC] No.19118064[source]▶

>>19117402 #

Extensions like ublock having public block lists makes it even easier for Facebook to write something that breaks the filter.

replies(1): >>19118386 #

28. tomatocracy ◴[08 Feb 19 20:22 UTC] No.19118286{6}[source]▶

>>19117717 #

But there’s a trade off. Fewer ads better placed means fewer people likely to block them and more people likely to actually look at them instead of skim past. It’s also likely that the people who were blocking ads are less likely to click on them and thus (for some advertisers) lower value impressions anyway.

Revenue is (simplistically) the product of impressions and value per impression. It’s not therefore immediately obvious that moves like this actually do increase revenue for them, especially over the long term since one potential side effect of doing this is giving more ammunition to the ‘delete Facebook’ crowd.

29. IshKebab ◴[08 Feb 19 20:35 UTC] No.19118386{3}[source]▶

>>19118064 #

Yeah and Facebook has far more resources than uBlock to fight the cat-and-mouse. So it is effective.

30. TimothyBJacobs ◴[08 Feb 19 22:56 UTC] No.19119468{4}[source]▶

>>19117833 #

It seems like a time span of minutes wouldn't be fast enough for on-the-fly blocking of sponsored posts?

replies(1): >>19119497 #

31. folkhack ◴[08 Feb 19 23:03 UTC] No.19119497{5}[source]▶

>>19119468 #

Oh yea - I guess I had a specific use case in mind when I said that =)

What I meant is that I can hammer out some Node/Python that will grab an image w/text and put it through OCR for character extraction. "Programming" it would take me a handful of minutes.

replies(1): >>19123648 #

32. taf2 ◴[09 Feb 19 00:56 UTC] No.19119999{4}[source]▶

>>19116931 #

And then you’d switch to an img tag

33. cauk ◴[09 Feb 19 07:55 UTC] No.19121248[source]▶

>>19116099 (TP) #

Hey there. Regarding the semi-automatic “look for these things in the post, and however you find them replicate for others”. I’m new to scrappers, do you have a good resource you could link on this? Thanks!

34. TimothyBJacobs ◴[09 Feb 19 18:05 UTC] No.19123648{6}[source]▶

>>19119497 #

Ahh, that makes sense!

↑