←back to thread

707 points namukang | 1 comments | | HN request time: 0.366s | source
Show context
menthe ◴[] No.29261972[source]
As a web scraper, I'll say that because he is hooking into the browser like a debugger / remotely controlled browser, just like Puppeteer would - he is instantly detected by the Cloudflare, PerimeterX, Datadome bot management solutions; and will get consistently banned on his page reload for literally any site caring about bots.

He'd be better off running some javascript on the page instead (a-la Tampermonkey, but can be done really nicely with some server-served TypeScript) to scrape the pages stealthily and perform actions.

replies(4): >>29262248 #>>29262765 #>>29262768 #>>29263957 #
Siira ◴[] No.29263957[source]
Can you provide any guides on this? How will the server run the JS on their page automatically?
replies(1): >>29264238 #
menthe ◴[] No.29264238[source]
The easiest approach is be to use an extension like Tampermonkey, which can load (and reload) “scripts” from a web server. There are a few project templates on GitHub with Typescript+WebPack (e.g. https://github.com/xiaomingTang/template-ts-tampermonkey). You can automate with any of your favorite Typescript libs, from the comfort of your IDE, with hot reload included.. Pretty nifty, and projects can quickly get pretty big that way! I usually have one “script” that has broad permissions (e.g. all sites) with some form of router at the root of the code that branches to the different sites to evaluate.
replies(1): >>29264865 #
1. Siira ◴[] No.29264865[source]
Thanks!

From what I understand, this is only useful for doing scrapes manually by launching the target URL in a GUI Chrome instance? Or can this somehow work on a headless server? (I don't understand how one can automate this.)