←back to thread

255 points ColinWright | 1 comments | | HN request time: 0.314s | source
Show context
renegat0x0 ◴[] No.45777261[source]
Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.

I know a thing or two about web scraping.

There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).

Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.

Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.

That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.

https://github.com/rumca-js/crawler-buddy

Based on library

https://github.com/rumca-js/webtoolkit

replies(4): >>45777917 #>>45778938 #>>45779662 #>>45781282 #
1. 1vuio0pswjnm7 ◴[] No.45781282[source]
Is there a difference between "scraping" and "crawling"