AI scrapers request commented scripts

(cryptography.dog)

255 points ColinWright | 1 comments | 31 Oct 25 15:44 UTC | HN request time: 0.229s | source

Show context

renegat0x0 ◴[31 Oct 25 22:11 UTC] No.45777261[source]▶

Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.

I know a thing or two about web scraping.

There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).

Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.

Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.

That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.

https://github.com/rumca-js/crawler-buddy

Based on library

https://github.com/rumca-js/webtoolkit

replies(4): >>45777917 #>>45778938 #>>45779662 #>>45781282 #

Mars008 ◴[01 Nov 25 03:08 UTC] No.45778938[source]▶

>>45777261 #

Looks like it's time for in-browser scrappers. They will be indistinguishable from the servers side. With AI driver can pass even human tests.

replies(3): >>45779480 #>>45779901 #>>45780069 #

1. bartread ◴[01 Nov 25 08:26 UTC] No.45780069[source]▶

>>45778938 #

Not a new idea. For years now, on the occasions I’ve needed to scrape, I’ve used a set of ViolentMonkey scripts. I’ve even considered creating an extension, but have never really needed it enough to do the extra work.

But this is why lots of sites implement captchas and other mechanisms to detect, frustrate, or trap automated activity - because plenty of bots run in browsers too.

↑