←back to thread

597 points classichasclass | 9 comments | | HN request time: 1.231s | source | bottom
Show context
bob1029 ◴[] No.45011628[source]
I think a lot of really smart people are letting themselves get taken for a ride by the web scraping thing. Unless the bot activity is legitimately hammering your site and causing issues (not saying this isn't happening in some cases), then this mostly amounts to an ideological game of capture the flag. The difference being that you'll never find their flag. The only thing you win by playing is lost time.

The best way to mitigate the load from diffuse, unidentifiable, grey area participants is to have a fast and well engineered web product. This is good news, because your actual human customers would really enjoy this too.

replies(7): >>45011652 #>>45011830 #>>45011850 #>>45012424 #>>45012462 #>>45015038 #>>45015451 #
1. sidewndr46 ◴[] No.45015451[source]
I don't think you have any idea how serious the issue is. I was loosely speaking in charge of application-level performance at one job for a web app. I was asked to make the backend as fast as possible at dumping the last byte of HTML back to the user.

The problem I ran into was performance was bimodal. We had this one group of users that was lightning fast and the rest were far slower. I chased down a few obvious outliers (that one forum thread with 11000 replies that some guy leaves up on a browser tab all the time, etc.) but it was still bimodal. Eventually I just changed the application level code to display known bots as one performance trace and everything else as another trace.

60% of all requests are known bots. This doesn't even count the random ass bot that some guy started up at an ISP. Yes, this really happened. We were paying customer of a company who decided to just conduct a DoS attack on us at 2 PM one afternoon. It took down the website.

Not only that, the bots effectively always got a cached response since they all seemed to love to hammer the same pages. Users never got a cached response, since LRU cache eviction meant the actual discussions with real users were always evicted. There were bots that would just rescrape every page they had ever seen every few minutes. There were bots that would just increase their throughput until the backend app would start to slow down.

There were bots that would run the javascript for whatever insane reason and start emulating users submitting forms, etc.

You probably are thinking "but you got to appear in a search index so it is worth it". Not really. Google's bot was one of the few well behaved ones and would even slow scraping if it saw a spike in the response times. Also we had an employee who was responsible for categorizing our organic search performance. While we had a huge amount of traffic from organic search, it was something like 40% to just one URL.

Retrospectively I'm now aware that a bunch of this was early stage AI companies scraping the internet for data.

replies(2): >>45018235 #>>45019598 #
2. korkybuchek ◴[] No.45018235[source]
> Google's bot was one of the few well behaved ones and would even slow scraping if it saw a spike in the response times.

Google has invested decades of core research with an army of PhDs into its crawler, particularly around figuring out when to recrawl a page. For example (a bit dated, but you can follow the refs if you're interested):

https://www.niss.org/sites/default/files/Tassone_interface6....

3. hinkley ◴[] No.45019598[source]
One of our customers was paying a third party to hit our website with garbage traffic a couple times a week to make sure we were rejecting malformed requests. I was forever tripping over these in Splunk while trying to look for legitimate problems.

We also had a period where we generated bad URLs for a week or two, and the worst part was I think they were on links marked nofollow. Three years later there was a bot still trying to load those pages.

And if you 429 Google’s bots they will reduce your pagerank. That’s straight up extortion from a company that also sells cloud services.

I don’t agree with you about Google being well behaved. They were doing no follow links, and they also are terrible if you’re serving content on vanity URLs. Any throttling they do on one domain name just hits two more.

replies(3): >>45019722 #>>45020189 #>>45022080 #
4. xp84 ◴[] No.45019722[source]
> they were on links marked nofollow

if i'm understanding you correctly you had an indexable page that contained links with nofollow attribute on the <a> tags.

It's possible some other mechanism got those URLs into the crawler like a person visiting them? Nofollow on the link won't prevent the URL from being crawled or indexed. If you're returning a 404 for them, you ought to be able to use webmaster tools or whatever it's called now, to request removal.

replies(1): >>45019830 #
5. hinkley ◴[] No.45019830{3}[source]
The dumbest part is that we’d known about this for a long time and one day someone discovered we’d implemented a feature toggle to remove those URLs and then it just never got turned on, despite being announced that it had.

They were meant to be interactive URLs on search pages. Someone implemented them I think trying to allow A11y to work but the bots were slamming us. We also weren’t doing canonical URLs right in the destination page so they got searched again every scan cycle. So at least three dumb things were going on, but the sorts of mistakes that normal people could make.

6. sidewndr46 ◴[] No.45020189[source]
I guess my position it was comparatively well behaved? There were bots that would full speed blitz the website, for absolutely no reason. You just scraped this page 27 seconds ago, do you really need to check it for an update again? Also it hasn't had a new post in the past 3 years, is it really going to start being lively again?
7. dilyevsky ◴[] No.45022080[source]
> And if you 429 Google’s bots they will reduce your pagerank. That’s straight up extortion from a company that also sells cloud services.

Googlebot uses different IP space from gcp

replies(1): >>45030250 #
8. hinkley ◴[] No.45030250{3}[source]
They use the same bank accounts and stock ticker. This is basically a non sequitur.

The point is they’re getting paid to run cloud servers to keep their bots happy and not dropping your website to page six.

replies(1): >>45030793 #
9. dilyevsky ◴[] No.45030793{4}[source]
I thought the argument was that if you run on gcp you can masquerade as googlebot and not get a 429 which is obviously false. Instead it looks like the argument is more of a tinfoil hat variety.

btw you don't get dropped if you issue temporary 429s only when it's consistent and/or the site is broken. that is well documented. and wtf else are they supposed to do if you don't allow to crawl it and it goes stale?