Most active commenters

buro9(8)
frereubu(3)

Popular/hot comments

>>42551781 #
>>42551536 #
>>42552257 #

←back to thread

AI companies cause most of traffic on forums

(pod.geraspora.de)

1. buro9 ◴[30 Dec 24 17:36 UTC] No.42551470[source]▶

>>42549624 (OP) #

Their appetite cannot be quenched, and there is little to no value in giving them access to the content.

I have data... 7d from a single platform with about 30 forums on this instance.

4.8M hits from Claude 390k from Amazon 261k from Data For SEO 148k from Chat GPT

That Claude one! Wowser.

Bots that match this (which is also the list I block on some other forums that are fully private by default):

I am moving to just blocking them all, it's ridiculous.

Everything on this list got itself there by being abusive (either ignoring robots.txt, or not backing off when latency increased).

replies(9): >>42551536 #>>42551667 #>>42551719 #>>42551781 #>>42551798 #>>42551877 #>>42552584 #>>42552786 #>>42565241 #

2. coldpie ◴[30 Dec 24 17:42 UTC] No.42551536[source]▶

>>42551470 (TP) #

You know, at this point, I wonder if an allowlist would work better.

replies(3): >>42551619 #>>42551671 #>>42552608 #

3. frereubu ◴[30 Dec 24 17:48 UTC] No.42551619[source]▶

>>42551536 #

I love (hate) the idea of a site where you need to send a personal email to the webmaster to be whitelisted.

replies(2): >>42552764 #>>42565862 #

4. pogue ◴[30 Dec 24 17:52 UTC] No.42551667[source]▶

>>42551470 (TP) #

What do you use to block them?

replies(1): >>42551696 #

5. buro9 ◴[30 Dec 24 17:52 UTC] No.42551671[source]▶

>>42551536 #

I have thought about writing such a thing...

1. A proxy that looks at HTTP Headers and TLS cipher choices

2. An allowlist that records which browsers send which headers and selects which ciphers

3. A dynamic loading of the allowlist into the proxy at some given interval

New browser versions or updates to OSs would need the allowlist updating, but I'm not sure it's that inconvenient and could be done via GitHub so people could submit new combinations.

I'd rather just say "I trust real browsers" and dump the rest.

Also I noticed a far simpler block, just block almost every request whose UA claims to be "compatible".

replies(2): >>42552152 #>>42575828 #

6. buro9 ◴[30 Dec 24 17:54 UTC] No.42551696[source]▶

>>42551667 #

Nginx, it's nothing special it's just my load balancer.

if ($http_user_agent ~* (list|of|case|insensitive|things|to|block)) {return 403;}

replies(2): >>42552020 #>>42555075 #

7. Mistletoe ◴[30 Dec 24 17:56 UTC] No.42551719[source]▶

>>42551470 (TP) #

This is a new twist on the Dead Internet Theory I hadn’t thought of.

replies(1): >>42562204 #

8. ai-christianson ◴[30 Dec 24 18:01 UTC] No.42551781[source]▶

>>42551470 (TP) #

Would you consider giving these crawlers access if they paid you?

replies(5): >>42551804 #>>42552121 #>>42552257 #>>42552452 #>>42566610 #

9. vunderba ◴[30 Dec 24 18:03 UTC] No.42551798[source]▶

>>42551470 (TP) #

There's also popular repository that maintains a comprehensive list of LLM and AI related bots to aid in blocking these abusive strip miners.

https://github.com/ai-robots-txt/ai.robots.txt

replies(1): >>42575268 #

10. buro9 ◴[30 Dec 24 18:03 UTC] No.42551804[source]▶

>>42551781 #

At this point, no.

11. Aeolun ◴[30 Dec 24 18:10 UTC] No.42551877[source]▶

>>42551470 (TP) #

You just plain blocking anyone using node from programatically accessing your content with Axios?

replies(2): >>42552002 #>>42594675 #

12. buro9 ◴[30 Dec 24 18:21 UTC] No.42552002[source]▶

>>42551877 #

Apparently yes.

If a more specific UA hasn't been set, and the library doesn't force people to do so, then the library that has been the source of abusive behaviour is blocked.

No loss to me.

13. gs17 ◴[30 Dec 24 18:23 UTC] No.42552020{3}[source]▶

>>42551696 #

From the article:

> If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really).

It would be interesting if you had any data about this, since you seem like you would notice who behaves "better" and who tries every trick to get around blocks.

replies(1): >>42564061 #

14. petee ◴[30 Dec 24 18:32 UTC] No.42552121[source]▶

>>42551781 #

Interesting idea, though I doubt they'd ever offer a reasonable amount for it. But doesn't it also change a sites legal stance if you're now selling your users content/data? I think it would also repel a number of users away from your service

15. qazxcvbnmlp ◴[30 Dec 24 18:36 UTC] No.42552152{3}[source]▶

>>42551671 #

Everything on this can be programmatically simulated by a bot with bad intentions. It will be a cat and mouse game of finding behaviors that differentiate between bot and not and patching them.

To truly say “I trust real browsers” requires a signal of integrity of the user and browser such as cryptographic device attestation of the browser. .. which has to be centrally verified. Which is also not great.

replies(1): >>42552259 #

16. nedrocks ◴[30 Dec 24 18:46 UTC] No.42552257[source]▶

>>42551781 #

This is one of the few interesting uses of crypto transactions at reasonable scale in the real world.

replies(3): >>42552580 #>>42553733 #>>42571945 #

17. coldpie ◴[30 Dec 24 18:46 UTC] No.42552259{4}[source]▶

>>42552152 #

> Everything on this can be programmatically simulated by a bot with bad intentions. It will be a cat and mouse game of finding behaviors that differentiate between bot and not and patching them.

Forcing Facebook & Co to play the adversary role still seems like an improvement over the current situation. They're clearly operating illegitimately if they start spoofing real user agents to get around bot blocking capabilities.

replies(1): >>42552775 #

18. ◴[30 Dec 24 19:12 UTC] No.42552452[source]▶

>>42551781 #

19. heavyarms ◴[30 Dec 24 19:28 UTC] No.42552580{3}[source]▶

>>42552257 #

What mechanism would make it possible to enforce non-paywalled, non-authenticated access to public web pages? This is a classic "problem of the commons" type of issue.

The AI companies are signing deals with large media and publishing companies to get access to data without the threat of legal action. But nobody is going to voluntarily make deals with millions of personal blogs, vintage car forums, local book clubs, etc. and setup a micro payment system.

Any attempt to force some kind of micro payment or "prove you are not a robot" system will add a lot of friction for actual users and will be easily circumvented. If you are LinkedIn and you can devote a large portion of your R&D budget on this, you can maybe get it to work. But if you're running a blog on stamp collecting, you probably will not.

20. jprete ◴[30 Dec 24 19:29 UTC] No.42552584[source]▶

>>42551470 (TP) #

I hope this is working out for you; the original article indicates that at least some of these crawlers move to innocuous user agent strings and change IPs if they get blocked or rate-limited.

21. jprete ◴[30 Dec 24 19:32 UTC] No.42552608[source]▶

>>42551536 #

If you mean user-agent-wise, I think real users vary too much to do that.

That could also be a user login, maybe, with per-user rate limits. I expect that bot runners could find a way to break that, but at least it's extra engineering effort on their part, and they may not bother until enough sites force the issue.

22. smolder ◴[30 Dec 24 19:53 UTC] No.42552764{3}[source]▶

>>42551619 #

We just need a browser plugin to auto-email webmasters to request access, and wait for the follow-up "access granted" email. It could be powered by AI.

replies(1): >>42561614 #

23. Terr_ ◴[30 Dec 24 19:54 UTC] No.42552775{5}[source]▶

>>42552259 #

I'm imagining a quixotic terms of service, where "by continuing" any bot access grants the site-owner a perpetual and irrevocable license to use and relicense all data, works, or other products resulting from any use of the crawled content, including but not limited to cases where that content was used in a statistical text generative model.

24. iLoveOncall ◴[30 Dec 24 19:55 UTC] No.42552786[source]▶

>>42551470 (TP) #

4.8M requests sounds huge, but if it's over 7 days and especially split amongst 30 websites, it's only a TPS of 0.26, not exactly very high or even abusive.

The fact that you choose to host 30 websites on the same instance is irrelevant, those AI bots scan websites, not servers.

This has been a recurring pattern I've seen in people complaining about AI bots crawling their website: huge number of requests but actually a low TPS once you dive a bit deeper.

replies(1): >>42552904 #

25. buro9 ◴[30 Dec 24 20:06 UTC] No.42552904[source]▶

>>42552786 #

It's never that smooth.

In fact 2M requests arrived on December 23rd from Claude alone for a single site.

Average 25qps is definitely an issue, these are all long tail dynamic pages.

replies(1): >>42555164 #

26. oblio ◴[30 Dec 24 21:25 UTC] No.42553733{3}[source]▶

>>42552257 #

Use the ex-hype to kill the new hype?

And the ex-hype would probably fail at that, too :-)

27. l1n ◴[31 Dec 24 00:13 UTC] No.42555075{3}[source]▶

>>42551696 #

403 is generally a bad way to get crawlers to go away - https://developers.google.com/search/blog/2023/02/dont-404-m... suggests a 500, 503, or 429 HTTP status code.

replies(2): >>42562172 #>>42573039 #

28. l1n ◴[31 Dec 24 00:30 UTC] No.42555164{3}[source]▶

>>42552904 #

Curious what your robots.txt looked like, if you have a link?

29. ndileas ◴[31 Dec 24 20:32 UTC] No.42561614{4}[source]▶

>>42552764 #

Then someone will require a notarized statement of intent before you can read the recipe blog.

replies(1): >>42565694 #

30. vultour ◴[31 Dec 24 21:57 UTC] No.42562172{4}[source]▶

>>42555075 #

That article describes the exact behaviour you want from the AI crawlers. If you let them know they’re rate limited they’ll just change IP or user agent.

31. Dilettante_ ◴[31 Dec 24 22:03 UTC] No.42562204[source]▶

>>42551719 #

We'll have two entirely separate (dead) internets! One for real hosts who will only get machine users, and one for real users who only get machine content!

Wait, that seems disturbingly conceivable with the way things are going right now. *shudder*

32. Libcat99 ◴[01 Jan 25 04:55 UTC] No.42564061{4}[source]▶

>>42552020 #

Switching to sending wrong, inexpensive data might be preferable to blocking them.

I've used this with voip scanners.

replies(1): >>42564776 #

33. buro9 ◴[01 Jan 25 08:43 UTC] No.42564776{5}[source]▶

>>42564061 #

Oh I did this with the Facebook one and redirected them to a 100MB file of garbage that is part of the Cloudflare speed test... they hit this so many times that it would've been 2PB sent in a matter of hours.

I contacted the network team at Cloudflare to apologise and also to confirm whether Facebook did actually follow the redirect... it's hard for Cloudflare to see 2PB, that kind of number is too small on a global scale when it's occurred over a few hours, but given that it was only a single PoP that would've handled it, then it would've been visible.

It was not visible, which means we can conclude that Facebook were not following redirects, or if they were, they were just queuing it for later and would only hit it once and not multiple times.

replies(1): >>42572442 #

34. EVa5I7bHFq9mnYK ◴[01 Jan 25 10:46 UTC] No.42565241[source]▶

>>42551470 (TP) #

>> there is little to no value in giving them access to the content

If you are an online shop, for example, isn't it beneficial that ChatGPT can recommend your products? Especially given that people now often consult ChatGPT instead of searching at Google?

replies(1): >>42566597 #

35. frereubu ◴[01 Jan 25 12:46 UTC] No.42565694{5}[source]▶

>>42561614 #

Now we're talking. Some kind of requirement for government-issued ID too.

replies(1): >>42567965 #

36. Kuraj ◴[01 Jan 25 13:23 UTC] No.42565862{3}[source]▶

>>42551619 #

I have not heard the word "webmaster" in such a long time

replies(1): >>42567278 #

37. rchaud ◴[01 Jan 25 15:31 UTC] No.42566597[source]▶

>>42565241 #

> If you are an online shop, for example, isn't it beneficial that ChatGPT can recommend your products?

ChatGPT won't 'recommend' anything that wasn't already recommended in a Reddit post, or on an Amazon page with 5000 reviews.

You have however correctly spotted the market opportunity. Future versions of CGPT with offer the ability to "promote" your eshop in responses, in exchange for money.

38. rchaud ◴[01 Jan 25 15:33 UTC] No.42566610[source]▶

>>42551781 #

No, because the price they'd offer would be insultingly low. The only way to get a good price is to take them to court for prior IP theft (as NYT and others have done), and get lawyers involved to work out a licensing deal.

39. frereubu ◴[01 Jan 25 16:59 UTC] No.42567278{4}[source]▶

>>42565862 #

Deliberately chosen for the nostalgia value :)

40. ◴[01 Jan 25 18:27 UTC] No.42567965{6}[source]▶

>>42565694 #

41. ranger207 ◴[02 Jan 25 05:30 UTC] No.42571945{3}[source]▶

>>42552257 #

What does crypto add here that can't be accomplished with regular payments?

42. tliltocatl ◴[02 Jan 25 07:16 UTC] No.42572442{6}[source]▶

>>42564776 #

Hmm, what about 1kb of carefully crafted gz-bomb? Or a TCP tarpit (this one would be a bit difficult to deploy).

43. buro9 ◴[02 Jan 25 09:38 UTC] No.42573039{4}[source]▶