Most active commenters
  • menthe(4)

←back to thread

707 points namukang | 13 comments | | HN request time: 1.155s | source | bottom
1. menthe ◴[] No.29261972[source]
As a web scraper, I'll say that because he is hooking into the browser like a debugger / remotely controlled browser, just like Puppeteer would - he is instantly detected by the Cloudflare, PerimeterX, Datadome bot management solutions; and will get consistently banned on his page reload for literally any site caring about bots.

He'd be better off running some javascript on the page instead (a-la Tampermonkey, but can be done really nicely with some server-served TypeScript) to scrape the pages stealthily and perform actions.

replies(4): >>29262248 #>>29262765 #>>29262768 #>>29263957 #
2. dkthehuman ◴[] No.29262248[source]
This is simply not accurate, and you can easily test the claim. Just try running Browserflow on the sites you're thinking of and you can see for yourself whether it's instantly banned or not.
replies(1): >>29262689 #
3. hmcm55 ◴[] No.29262689[source]
Can confirm, as someone who spent 2 years building software to beat recaptchas/bot management. I literally told DK that there was no way that Browserflow could solve the problems I spent years fighting against. I was wrong... it was humbling.
4. bdcravens ◴[] No.29262765[source]
Run it against https://bot.sannysoft.com/ to see how it stacks up

Most anti-Puppeteer tech analyzes the state of various browser Javascript objects, and if you run Puppeteer in headful mode with plugins like https://www.npmjs.com/package/puppeteer-extra-plugin-stealth you'll bypass most detection.

replies(2): >>29263653 #>>29264183 #
5. colordrops ◴[] No.29262768[source]
How exactly do these services detect Puppeteer?
replies(1): >>29263823 #
6. shaicoleman ◴[] No.29263653[source]
Disabling headless and adding the following command line option: --disable-blink-features=AutomationControlled

is enough to pass all the tests above with cuprite (Ruby), without needing any extra plugins

7. shaicoleman ◴[] No.29263823[source]
They run JS tests such as the one linked in the peer comment: https://bot.sannysoft.com/
replies(1): >>29264160 #
8. Siira ◴[] No.29263957[source]
Can you provide any guides on this? How will the server run the JS on their page automatically?
replies(1): >>29264238 #
9. menthe ◴[] No.29264160{3}[source]
Not only that - enterprise bot management protections will run behavioral identification (e.g. how your mouse moves —> AI -> bot yes/no), TCP stack fingerprinting (and other devices if available e.g. gyroscope), TLS ClientHello fingerprinting (e.g. see https://github.com/salesforce/ja3), etc. Lots of very unique info in the Scraping Enthusiasts discord where lots of pro scrapers hang out.
replies(1): >>29278397 #
10. menthe ◴[] No.29264183[source]
This is not true, run playwright/puppeteer with puppeteer-stealth + headful + plugins + fonts + OpenGL fingerprinting workarounds and you’ll still 100% be caught by PerimeterX and Datadome if the site’s sensitivity is set to anything but “Low”.

Talk with berstend (author of puppeteer-extra/stealth), join their Discord or read some of his Github comments and you will quickly get confirmation that none of those methods are good enough in 2021 (even with residential proxies or CGNAT 5G proxies).

11. menthe ◴[] No.29264238[source]
The easiest approach is be to use an extension like Tampermonkey, which can load (and reload) “scripts” from a web server. There are a few project templates on GitHub with Typescript+WebPack (e.g. https://github.com/xiaomingTang/template-ts-tampermonkey). You can automate with any of your favorite Typescript libs, from the comfort of your IDE, with hot reload included.. Pretty nifty, and projects can quickly get pretty big that way! I usually have one “script” that has broad permissions (e.g. all sites) with some form of router at the root of the code that branches to the different sites to evaluate.
replies(1): >>29264865 #
12. Siira ◴[] No.29264865{3}[source]
Thanks!

From what I understand, this is only useful for doing scrapes manually by launching the target URL in a GUI Chrome instance? Or can this somehow work on a headless server? (I don't understand how one can automate this.)

13. zdware ◴[] No.29278397{4}[source]
I was on a project that used Google's Enterprise captcha v3 (passive mode, with all that "AI" jazz) and it was hot garbage. We tested against it using a simple selenium script and even though `navigator.webdriver` was true, it still gave 9/10 "likely a human".