Most active commenters
  • matt-p(3)
  • swed420(3)

←back to thread

454 points positiveblue | 24 comments | | HN request time: 0.416s | source | bottom
1. matt-p ◴[] No.45066473[source]
I have zero issue with Ai Agents, if there's a real user behind there somewhere. I DO have a major issue with my sites being crawled extremely aggressively by offenders including Meta, Perplexity and OpenAI - it's really annoying realising that we're tying up several cpu cores on AI crawling. Less than on real users and google et al.
replies(6): >>45066494 #>>45066689 #>>45066754 #>>45067321 #>>45067530 #>>45068488 #
2. Operyl ◴[] No.45066494[source]
They're getting to the point of 200-300RPS for some of my smaller marketing sites, hallucinating URLs like crazy. It's fucking insane.
replies(2): >>45066518 #>>45066583 #
3. matt-p ◴[] No.45066518[source]
I'm seeing around the same, as a fairly constant base load. Even more annoying when it's hitting auth middleware constantly, over and over again somehow expecting a different answer.
4. palmfacehn ◴[] No.45066583[source]
You'd think they would have an interest in developing reasonable crawling infrastructure, like Google, Bing or Yandex. Instead they go all in on hosts with no metering. All of the search majors reduce their crawl rate as request times increase.

On one hand these companies announce themselves as sophisticated, futuristic and highly-valued, on the other hand we see rampant incompetence, to the point that webmasters everywhere are debating the best course of action.

replies(3): >>45066630 #>>45071958 #>>45081606 #
5. matt-p ◴[] No.45066630{3}[source]
Honestly it's just tragedy of the commons. Why put the effort in when you don't have to identify yourself, just crawl and if you get blocked move the job to another server.
replies(1): >>45066686 #
6. palmfacehn ◴[] No.45066686{4}[source]
At this point I'm blocking several ASNs. Most are cloud provider related, but there are also some repurposed consumer ASNs coming out of the PRC. Long term, this devalues the offerings of those cloud providers, as prospective customers will not be able to use them for crawling.
replies(1): >>45091584 #
7. rikafurude21 ◴[] No.45066689[source]
Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data. Meta, Perplexity and OpenAI all have some kind of web-search functionality where they sent requests based on user prompts. These are not requests that get saved to train the next LLM. Cloudflare intentionally blurs the line between both types of bots, and in that sense it is a bait-and-switch where they claim to 'protect content creators' by being the man in the middle and collecting tolls from LLM providers to pay creators (and of course take a cut for themselves). Its not something they do because it would be fair, theres financial motivation.
replies(1): >>45066719 #
8. jsheard ◴[] No.45066719[source]
> Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data.

That distinction requires you to take companies which benefit from amassing as much training data as possible at their word when they pinky swear that a particular request is totally not for training, promise.

replies(1): >>45066796 #
9. asats ◴[] No.45066754[source]
I've some personal apps online and I had to turn the cloudflare ai bot protection on because one of them got 1.6TB of data accessed by the bots in the last month, 1.3 million requests per day, just non stop hammering it with no limits.
replies(1): >>45075507 #
10. rikafurude21 ◴[] No.45066796{3}[source]
If you look at the current LLM landscape, the frontier is not being pushed by labs throwing more data at their models - most improvements come from using more compute and improving training methods. In that sense I dont have to take their word, more data just hasnt been the problem for a long time.
replies(2): >>45066909 #>>45066942 #
11. ◴[] No.45066909{4}[source]
12. jsheard ◴[] No.45066942{4}[source]
Just today Anthropic announced that they will begin using their users data for training by default - they still want fresh data so badly that they risked alienating their own paying customers to get some more. They're at the stage of pulling the copper out of the walls to feed their crippling data addiction.
13. chatmasta ◴[] No.45067321[source]
I wonder how many CPU cycles are spent because of AI companies scraping content. This factor isn't usually considered when estimating “environmental impact of AI.” What’s the overhead of this on top of inference and training?

To be fair, an accurate measurement would need to consider how many of those CPU cycles would be spent by the human user who is driving the bot. From that perspective, maybe the scrapers can “make up for it” by crawling efficiently, i.e. avoid loading tracker scripts, images, etc unless necessary to solve the query. This way they’ll still burn CPU cycles but at least it’ll be less cycles than a human user with a headful browser instance.

14. swed420 ◴[] No.45067530[source]
> I DO have a major issue with my sites being crawled extremely aggressively by offenders including Meta, Perplexity and OpenAI

Gee, if only we had, like, one central archive of the internet. We could even call it the internet archive.

Then, all these AI companies could interface directly with that single entity on terms that are agreeable.

replies(2): >>45067816 #>>45074266 #
15. teitoklien ◴[] No.45067816[source]
you think they care about that ? they’d still crawl like this just in case which is why they don’t rate limit atm
replies(1): >>45078306 #
16. zzo38computer ◴[] No.45068488[source]
Same with me. If there is a real user behind the use of the AI agents and they do not make excessive accesses in order to do what they are trying to do, then I do not have a complaint (the use of AI agents is not something I intend, but that is up to whoever is using them and not up to me). I do not like the excessive crawling.

However, what is more important to me than AI agents, is that someone might want to download single files with curl, or use browsers such as Lynx, etc, and this should work.

17. esperent ◴[] No.45071958{3}[source]
I suspect it's because they're dealing with such unbelievable levels of bandwidth and compute for training and inference that the amount required to blast the entire web like this barely registers to them.
18. gck1 ◴[] No.45074266[source]
Internet Archive is missing enormous chunks of the internet though. And I don't mean weird parts of the internet, just regional stuff.

Not even news articles from top 10 news websites from my country are usually indexed there.

replies(1): >>45078310 #
19. immibis ◴[] No.45075507[source]
So, under the free traffic tier of any decent provider.
replies(1): >>45076485 #
20. asats ◴[] No.45076485{3}[source]
The pages are not static and require computation to serve, and there's more than one app on that same bare metal server, so it was negatively affecting the performance of a lot of my other stuff.

If I couldn't easily cut off the majority of that bot volume I probably would've shut down the app entirely.

21. swed420 ◴[] No.45078306{3}[source]
It would of course need to be legally enforced somehow, with penalties high enough to hurt even the big players.
22. swed420 ◴[] No.45078310{3}[source]
So then make a better one. I was only referencing it as a general concept that can be approved upon as desired.
23. whatevaa ◴[] No.45081606{3}[source]
They vibe code their crawlers.
24. account42 ◴[] No.45091584{5}[source]
This is the correct solution and is how network abuse has been dealt with before the latest fad. Network operators can either police their own users or be blocked/throttled wholesale. There isn't anything more needed except for the willingness to apply measures to networks that are "too big to fail".