←back to thread

255 points ColinWright | 7 comments | | HN request time: 0.924s | source | bottom
Show context
renegat0x0 ◴[] No.45777261[source]
Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.

I know a thing or two about web scraping.

There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).

Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.

Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.

That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.

https://github.com/rumca-js/crawler-buddy

Based on library

https://github.com/rumca-js/webtoolkit

replies(4): >>45777917 #>>45778938 #>>45779662 #>>45781282 #
1. Mars008 ◴[] No.45778938[source]
Looks like it's time for in-browser scrappers. They will be indistinguishable from the servers side. With AI driver can pass even human tests.
replies(3): >>45779480 #>>45779901 #>>45780069 #
2. overfeed ◴[] No.45779480[source]
> Looks like it's time for in-browser scrappers.

If scrapers were as well-behaved as humans, website operators wouldn't bother to block them[1]. It's the abuse that motivates the animus and action. As the fine articles spelt out, scrapers are greedy in many ways, one of which is trying to slurp down as many URLs as possible without wasting bytes. Not enough people know about common crawl, or know how to write multithreaded scrapers with high utilization across domains without suffocating any single one. If your scraper is URL FIFO or stack in a loop, you're just DOSing one domain at a time.

1. The most successful scrapers avoid standing out in any way

replies(1): >>45779556 #
3. Mars008 ◴[] No.45779556[source]
The question is who runs them? There are only a few big companies like MS, Google, OpenAI, Anthropic. But from the posts here it looks like hordes of buggy scrapers run by enthusiasts.
replies(2): >>45780092 #>>45780978 #
4. eur0pa ◴[] No.45779901[source]
you mean OpenAI Atlas?
5. bartread ◴[] No.45780069[source]
Not a new idea. For years now, on the occasions I’ve needed to scrape, I’ve used a set of ViolentMonkey scripts. I’ve even considered creating an extension, but have never really needed it enough to do the extra work.

But this is why lots of sites implement captchas and other mechanisms to detect, frustrate, or trap automated activity - because plenty of bots run in browsers too.

6. iamacyborg ◴[] No.45780092{3}[source]
Lots of “data” companies out there that want to sell you scraped data sets.
7. luckylion ◴[] No.45780978{3}[source]
Ad companies, even the small ones, "Brand Protection" companies, IP lawyers looking for images that were used without license, Brand Marketing companies, where it matters also your competitors etc etc