←back to thread

454 points positiveblue | 8 comments | | HN request time: 0s | source | bottom
Show context
matt-p ◴[] No.45066473[source]
I have zero issue with Ai Agents, if there's a real user behind there somewhere. I DO have a major issue with my sites being crawled extremely aggressively by offenders including Meta, Perplexity and OpenAI - it's really annoying realising that we're tying up several cpu cores on AI crawling. Less than on real users and google et al.
replies(6): >>45066494 #>>45066689 #>>45066754 #>>45067321 #>>45067530 #>>45068488 #
1. Operyl ◴[] No.45066494[source]
They're getting to the point of 200-300RPS for some of my smaller marketing sites, hallucinating URLs like crazy. It's fucking insane.
replies(2): >>45066518 #>>45066583 #
2. matt-p ◴[] No.45066518[source]
I'm seeing around the same, as a fairly constant base load. Even more annoying when it's hitting auth middleware constantly, over and over again somehow expecting a different answer.
3. palmfacehn ◴[] No.45066583[source]
You'd think they would have an interest in developing reasonable crawling infrastructure, like Google, Bing or Yandex. Instead they go all in on hosts with no metering. All of the search majors reduce their crawl rate as request times increase.

On one hand these companies announce themselves as sophisticated, futuristic and highly-valued, on the other hand we see rampant incompetence, to the point that webmasters everywhere are debating the best course of action.

replies(3): >>45066630 #>>45071958 #>>45081606 #
4. matt-p ◴[] No.45066630[source]
Honestly it's just tragedy of the commons. Why put the effort in when you don't have to identify yourself, just crawl and if you get blocked move the job to another server.
replies(1): >>45066686 #
5. palmfacehn ◴[] No.45066686{3}[source]
At this point I'm blocking several ASNs. Most are cloud provider related, but there are also some repurposed consumer ASNs coming out of the PRC. Long term, this devalues the offerings of those cloud providers, as prospective customers will not be able to use them for crawling.
replies(1): >>45091584 #
6. esperent ◴[] No.45071958[source]
I suspect it's because they're dealing with such unbelievable levels of bandwidth and compute for training and inference that the amount required to blast the entire web like this barely registers to them.
7. whatevaa ◴[] No.45081606[source]
They vibe code their crawlers.
8. account42 ◴[] No.45091584{4}[source]
This is the correct solution and is how network abuse has been dealt with before the latest fad. Network operators can either police their own users or be blocked/throttled wholesale. There isn't anything more needed except for the willingness to apply measures to networks that are "too big to fail".