←back to thread

454 points positiveblue | 1 comments | | HN request time: 0.29s | source
Show context
matt-p ◴[] No.45066473[source]
I have zero issue with Ai Agents, if there's a real user behind there somewhere. I DO have a major issue with my sites being crawled extremely aggressively by offenders including Meta, Perplexity and OpenAI - it's really annoying realising that we're tying up several cpu cores on AI crawling. Less than on real users and google et al.
replies(6): >>45066494 #>>45066689 #>>45066754 #>>45067321 #>>45067530 #>>45068488 #
rikafurude21 ◴[] No.45066689[source]
Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data. Meta, Perplexity and OpenAI all have some kind of web-search functionality where they sent requests based on user prompts. These are not requests that get saved to train the next LLM. Cloudflare intentionally blurs the line between both types of bots, and in that sense it is a bait-and-switch where they claim to 'protect content creators' by being the man in the middle and collecting tolls from LLM providers to pay creators (and of course take a cut for themselves). Its not something they do because it would be fair, theres financial motivation.
replies(1): >>45066719 #
jsheard ◴[] No.45066719[source]
> Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data.

That distinction requires you to take companies which benefit from amassing as much training data as possible at their word when they pinky swear that a particular request is totally not for training, promise.

replies(1): >>45066796 #
rikafurude21 ◴[] No.45066796[source]
If you look at the current LLM landscape, the frontier is not being pushed by labs throwing more data at their models - most improvements come from using more compute and improving training methods. In that sense I dont have to take their word, more data just hasnt been the problem for a long time.
replies(2): >>45066909 #>>45066942 #
1. jsheard ◴[] No.45066942[source]
Just today Anthropic announced that they will begin using their users data for training by default - they still want fresh data so badly that they risked alienating their own paying customers to get some more. They're at the stage of pulling the copper out of the walls to feed their crippling data addiction.