←back to thread

454 points positiveblue | 4 comments | | HN request time: 0.671s | source
Show context
TIPSIO ◴[] No.45066555[source]
Everyone loves the dream of a free for all and open web.

But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...

Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?

Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.

Great video: https://www.youtube.com/shorts/M0QyOp7zqcY

replies(37): >>45066600 #>>45066626 #>>45066827 #>>45066906 #>>45066945 #>>45066976 #>>45066979 #>>45067024 #>>45067058 #>>45067180 #>>45067399 #>>45067434 #>>45067570 #>>45067621 #>>45067750 #>>45067890 #>>45067955 #>>45068022 #>>45068044 #>>45068075 #>>45068077 #>>45068166 #>>45068329 #>>45068436 #>>45068551 #>>45068588 #>>45069623 #>>45070279 #>>45070690 #>>45071600 #>>45071816 #>>45075075 #>>45075398 #>>45077464 #>>45077583 #>>45080415 #>>45101938 #
arjie ◴[] No.45069623[source]
The dream is real, man. If you want open content on the Internet, it's never been a better time. My blog is open to all - machine or man. And it's hosted on my home server next to me. I don't see why anyone would bother trying to distinguish humans from AI. A human hitting your website too much is no different from an AI hitting your website too much.

I have a robots.txt that tries to help bots not get stuck in loops, but if they want to, they're welcome to. Let the web be open. Slurp up my stuff if you want to.

Amazonbot seems to love visiting my site, and it is always welcome.

replies(3): >>45069684 #>>45069787 #>>45070894 #
1. Twirrim ◴[] No.45070894[source]
The only bot that bugs the crap out of me is Anthropic's one. They're the reason I set up a labyrinth using iocaine (https://iocaine.madhouse-project.org/). Their bot was absurdly aggressive, particularly with retries.

It's probably trivial in the whole scheme of things, but I love that anthropic spent months making about 10rps against my stupid blog, getting markov chain responses generated from the text of Moby Dick. (looks like they haven't crawled my site for about a fortnight now)

replies(1): >>45071621 #
2. giardini ◴[] No.45071621[source]
No wonder Anthropic isn't working well! The "Moby Dicked" explanation of the state of AI!

But seriously, Why must someone search even a significant part of the public Internet to develop an AI? Is it believed that missing some text will cripple the AI?

Isn't there some sort of "law of diminishing returns" where, once some percentage of coverage is reached, further scraping is not cost-effective?

replies(1): >>45072504 #
3. tsimionescu ◴[] No.45072504[source]
On the contrary, AI training techniques require gigantic amounts of data to do anything, and there is no upper limit whatsoever - the more relevant data you have to train on, the better your model will be, period.

In fact, the biggest thing that is making it unlikely that LLM scaling will continue is that the current LLMs have already been trained on virtually every piece of human text we have access to today. So, without new training data (in large amounts), the only way they'll scale more is by new discoveries on how to train more efficiently - but there is no way to put a predictable timeline on that.

replies(1): >>45075504 #
4. ijk ◴[] No.45075504{3}[source]
Ironically, scaling limits and evidence that quality vastly outweighs quantity suggests that all that web data is much less useful than buying and scanning a book. Most work with the Common Crawl data, for example, has ended up focusing on filtering out vast amounts of data as being mostly useless for training purposes.

There was a hot minute in 2023 where it looked like we could just data and compute scale to the moon. Shockingly, it turns out there are limits to that approach.