←back to thread

707 points namukang | 2 comments | | HN request time: 0.431s | source
Show context
menthe ◴[] No.29261972[source]
As a web scraper, I'll say that because he is hooking into the browser like a debugger / remotely controlled browser, just like Puppeteer would - he is instantly detected by the Cloudflare, PerimeterX, Datadome bot management solutions; and will get consistently banned on his page reload for literally any site caring about bots.

He'd be better off running some javascript on the page instead (a-la Tampermonkey, but can be done really nicely with some server-served TypeScript) to scrape the pages stealthily and perform actions.

replies(4): >>29262248 #>>29262765 #>>29262768 #>>29263957 #
colordrops ◴[] No.29262768[source]
How exactly do these services detect Puppeteer?
replies(1): >>29263823 #
shaicoleman ◴[] No.29263823[source]
They run JS tests such as the one linked in the peer comment: https://bot.sannysoft.com/
replies(1): >>29264160 #
1. menthe ◴[] No.29264160[source]
Not only that - enterprise bot management protections will run behavioral identification (e.g. how your mouse moves —> AI -> bot yes/no), TCP stack fingerprinting (and other devices if available e.g. gyroscope), TLS ClientHello fingerprinting (e.g. see https://github.com/salesforce/ja3), etc. Lots of very unique info in the Scraping Enthusiasts discord where lots of pro scrapers hang out.
replies(1): >>29278397 #
2. zdware ◴[] No.29278397[source]
I was on a project that used Google's Enterprise captcha v3 (passive mode, with all that "AI" jazz) and it was hot garbage. We tested against it using a simple selenium script and even though `navigator.webdriver` was true, it still gave 9/10 "likely a human".