←back to thread

597 points classichasclass | 8 comments | | HN request time: 0.001s | source | bottom
Show context
bob1029 ◴[] No.45011628[source]
I think a lot of really smart people are letting themselves get taken for a ride by the web scraping thing. Unless the bot activity is legitimately hammering your site and causing issues (not saying this isn't happening in some cases), then this mostly amounts to an ideological game of capture the flag. The difference being that you'll never find their flag. The only thing you win by playing is lost time.

The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.

replies(7): >>45011652 #>>45011830 #>>45011850 #>>45012424 #>>45012462 #>>45015038 #>>45015451 #
phito ◴[] No.45011652[source]
My friend has a small public gitea instance, only use by him a a few friends. He's getting thousounds of requests an hour from bots. I'm sorry but even if it does not impact his service, at the very least it feels like harassment
replies(7): >>45011694 #>>45011816 #>>45011999 #>>45013533 #>>45013955 #>>45014807 #>>45025114 #
1. bob1029 ◴[] No.45011816[source]
Thousands of requests per hour? So, something like 1-3 per second?

If this is actually impacting perceived QoS then I think a gitea bug report would be justified. Clearly there's been some kind of a performance regression.

Just looking at the logs seems to be an infohazard for many people. I don't see why you'd want to inspect the septic tanks of the internet unless absolutely necessary.

replies(5): >>45014694 #>>45014705 #>>45015142 #>>45016540 #>>45019745 #
2. zeta0134 ◴[] No.45014694[source]
One of the most common issues we helped customers solve when I worked in web hosting was low disk alerts, usually because the log rotation had failed. Often the content of those logs was exactly this sort of nonsense and had spiked recently due to a scraper. The sheer size of the logs can absolutely be a problem on a smaller server, which is more and more common now that the inexpensive server is often a VM or a container.
3. tedivm ◴[] No.45014705[source]
Depending on what they're actually pulling down this can get pretty expensive. Bandwidth isn't free.
4. dkiebd ◴[] No.45015142[source]
I love the snark here. I work at a hosting company and the only customers who have issues with crawlers are those who have stupidly slow webpages. It’s hard to have any sympathy for them.
replies(1): >>45018800 #
5. p3rls ◴[] No.45016540[source]
i usually get 10 a second hitting the same content pages 10 times an hour, is that not what you guys are getting from google bot?
6. egypturnash ◴[] No.45018800[source]
Isn't it part of your job to help them fix that?
replies(1): >>45019092 #
7. 0x457 ◴[] No.45019092{3}[source]
How? They are hosting company, not a webshop.
8. hinkley ◴[] No.45019745[source]
We were only getting 60% of our from bots at my last place because we throttled a bunch of sketchy bots to around 50 simultaneous requests. Which was on the order of 100/s. Our customers were paying for SEO so the bot traffic was a substantial cost of doing business. But as someone tasked with decreasing cluster size I was forever jealous of the large amount of cluster thatwasn’t being seen by humans.