←back to thread

707 points namukang | 3 comments | | HN request time: 0.217s | source
Show context
menthe ◴[] No.29261972[source]
As a web scraper, I'll say that because he is hooking into the browser like a debugger / remotely controlled browser, just like Puppeteer would - he is instantly detected by the Cloudflare, PerimeterX, Datadome bot management solutions; and will get consistently banned on his page reload for literally any site caring about bots.

He'd be better off running some javascript on the page instead (a-la Tampermonkey, but can be done really nicely with some server-served TypeScript) to scrape the pages stealthily and perform actions.

replies(4): >>29262248 #>>29262765 #>>29262768 #>>29263957 #
1. bdcravens ◴[] No.29262765[source]
Run it against https://bot.sannysoft.com/ to see how it stacks up

Most anti-Puppeteer tech analyzes the state of various browser Javascript objects, and if you run Puppeteer in headful mode with plugins like https://www.npmjs.com/package/puppeteer-extra-plugin-stealth you'll bypass most detection.

replies(2): >>29263653 #>>29264183 #
2. shaicoleman ◴[] No.29263653[source]
Disabling headless and adding the following command line option: --disable-blink-features=AutomationControlled

is enough to pass all the tests above with cuprite (Ruby), without needing any extra plugins

3. menthe ◴[] No.29264183[source]
This is not true, run playwright/puppeteer with puppeteer-stealth + headful + plugins + fonts + OpenGL fingerprinting workarounds and you’ll still 100% be caught by PerimeterX and Datadome if the site’s sensitivity is set to anything but “Low”.

Talk with berstend (author of puppeteer-extra/stealth), join their Discord or read some of his Github comments and you will quickly get confirmation that none of those methods are good enough in 2021 (even with residential proxies or CGNAT 5G proxies).