←back to thread

597 points classichasclass | 1 comments | | HN request time: 0.229s | source
Show context
bob1029 ◴[] No.45011628[source]
I think a lot of really smart people are letting themselves get taken for a ride by the web scraping thing. Unless the bot activity is legitimately hammering your site and causing issues (not saying this isn't happening in some cases), then this mostly amounts to an ideological game of capture the flag. The difference being that you'll never find their flag. The only thing you win by playing is lost time.

The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.

replies(7): >>45011652 #>>45011830 #>>45011850 #>>45012424 #>>45012462 #>>45015038 #>>45015451 #
sidewndr46 ◴[] No.45015451[source]
I don't think you have any idea how serious the issue is. I was loosely speaking in charge of application-level performance at one job for a web app. I was asked to make the backend as fast as possible at dumping the last byte of HTML back to the user.

The problem I ran into was performance was bimodal. We had this one group of users that was lightning fast and the rest were far slower. I chased down a few obvious outliers (that one forum thread with 11000 replies that some guy leaves up on a browser tab all the time, etc.) but it was still bimodal. Eventually I just changed the application level code to display known bots as one performance trace and everything else as another trace.

60% of all requests are known bots. This doesn't even count the random ass bot that some guy started up at an ISP. Yes, this really happened. We were paying customer of a company who decided to just conduct a DoS attack on us at 2 PM one afternoon. It took down the website.

Not only that, the bots effectively always got a cached response since they all seemed to love to hammer the same pages. Users never got a cached response, since LRU cache eviction meant the actual discussions with real users were always evicted. There were bots that would just rescrape every page they had ever seen every few minutes. There were bots that would just increase their throughput until the backend app would start to slow down.

There were bots that would run the javascript for whatever insane reason and start emulating users submitting forms, etc.

You probably are thinking "but you got to appear in a search index so it is worth it". Not really. Google's bot was one of the few well behaved ones and would even slow scraping if it saw a spike in the response times. Also we had an employee who was responsible for categorizing our organic search performance. While we had a huge amount of traffic from organic search, it was something like 40% to just one URL.

Retrospectively I'm now aware that a bunch of this was early stage AI companies scraping the internet for data.

replies(2): >>45018235 #>>45019598 #
hinkley ◴[] No.45019598[source]
One of our customers was paying a third party to hit our website with garbage traffic a couple times a week to make sure we were rejecting malformed requests. I was forever tripping over these in Splunk while trying to look for legitimate problems.

We also had a period where we generated bad URLs for a week or two, and the worst part was I think they were on links marked nofollow. Three years later there was a bot still trying to load those pages.

And if you 429 Google’s bots they will reduce your pagerank. That’s straight up extortion from a company that also sells cloud services.

I don’t agree with you about Google being well behaved. They were doing no follow links, and they also are terrible if you’re serving content on vanity URLs. Any throttling they do on one domain name just hits two more.

replies(3): >>45019722 #>>45020189 #>>45022080 #
1. sidewndr46 ◴[] No.45020189[source]
I guess my position it was comparatively well behaved? There were bots that would full speed blitz the website, for absolutely no reason. You just scraped this page 27 seconds ago, do you really need to check it for an update again? Also it hasn't had a new post in the past 3 years, is it really going to start being lively again?