Most active commenters
  • buro9(8)
  • frereubu(3)

←back to thread

770 points ta988 | 46 comments | | HN request time: 1.889s | source | bottom
1. buro9 ◴[] No.42551470[source]
Their appetite cannot be quenched, and there is little to no value in giving them access to the content.

I have data... 7d from a single platform with about 30 forums on this instance.

4.8M hits from Claude 390k from Amazon 261k from Data For SEO 148k from Chat GPT

That Claude one! Wowser.

Bots that match this (which is also the list I block on some other forums that are fully private by default):

(?i).(AhrefsBot|AI2Bot|AliyunSecBot|Amazonbot|Applebot|Awario|axios|Baiduspider|barkrowler|bingbot|BitSightBot|BLEXBot|Buck|Bytespider|CCBot|CensysInspect|ChatGPT-User|ClaudeBot|coccocbot|cohere-ai|DataForSeoBot|Diffbot|DotBot|ev-crawler|Expanse|FacebookBot|facebookexternalhit|FriendlyCrawler|Googlebot|GoogleOther|GPTBot|HeadlessChrome|ICC-Crawler|imagesift|img2dataset|InternetMeasurement|ISSCyberRiskCrawler|istellabot|magpie-crawler|Mediatoolkitbot|Meltwater|Meta-External|MJ12bot|moatbot|ModatScanner|MojeekBot|OAI-SearchBot|Odin|omgili|panscient|PanguBot|peer39_crawler|Perplexity|PetalBot|Pinterestbot|PiplBot|Protopage|scoop|Scrapy|Screaming|SeekportBot|Seekr|SemrushBot|SeznamBot|Sidetrade|Sogou|SurdotlyBot|Timpibot|trendictionbot|VelenPublicWebCrawler|WhatsApp|wpbot|xfa1|Yandex|Yeti|YouBot|zgrab|ZoominfoBot).

I am moving to just blocking them all, it's ridiculous.

Everything on this list got itself there by being abusive (either ignoring robots.txt, or not backing off when latency increased).

replies(9): >>42551536 #>>42551667 #>>42551719 #>>42551781 #>>42551798 #>>42551877 #>>42552584 #>>42552786 #>>42565241 #
2. coldpie ◴[] No.42551536[source]
You know, at this point, I wonder if an allowlist would work better.
replies(3): >>42551619 #>>42551671 #>>42552608 #
3. frereubu ◴[] No.42551619[source]
I love (hate) the idea of a site where you need to send a personal email to the webmaster to be whitelisted.
replies(2): >>42552764 #>>42565862 #
4. pogue ◴[] No.42551667[source]
What do you use to block them?
replies(1): >>42551696 #
5. buro9 ◴[] No.42551671[source]
I have thought about writing such a thing...

1. A proxy that looks at HTTP Headers and TLS cipher choices

2. An allowlist that records which browsers send which headers and selects which ciphers

3. A dynamic loading of the allowlist into the proxy at some given interval

New browser versions or updates to OSs would need the allowlist updating, but I'm not sure it's that inconvenient and could be done via GitHub so people could submit new combinations.

I'd rather just say "I trust real browsers" and dump the rest.

Also I noticed a far simpler block, just block almost every request whose UA claims to be "compatible".

replies(2): >>42552152 #>>42575828 #
6. buro9 ◴[] No.42551696[source]
Nginx, it's nothing special it's just my load balancer.

if ($http_user_agent ~* (list|of|case|insensitive|things|to|block)) {return 403;}

replies(2): >>42552020 #>>42555075 #
7. Mistletoe ◴[] No.42551719[source]
This is a new twist on the Dead Internet Theory I hadn’t thought of.
replies(1): >>42562204 #
8. ai-christianson ◴[] No.42551781[source]
Would you consider giving these crawlers access if they paid you?
replies(5): >>42551804 #>>42552121 #>>42552257 #>>42552452 #>>42566610 #
9. vunderba ◴[] No.42551798[source]
There's also popular repository that maintains a comprehensive list of LLM and AI related bots to aid in blocking these abusive strip miners.

https://github.com/ai-robots-txt/ai.robots.txt

replies(1): >>42575268 #
10. buro9 ◴[] No.42551804[source]
At this point, no.
11. Aeolun ◴[] No.42551877[source]
You just plain blocking anyone using node from programatically accessing your content with Axios?
replies(2): >>42552002 #>>42594675 #
12. buro9 ◴[] No.42552002[source]
Apparently yes.

If a more specific UA hasn't been set, and the library doesn't force people to do so, then the library that has been the source of abusive behaviour is blocked.

No loss to me.

13. gs17 ◴[] No.42552020{3}[source]
From the article:

> If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really).

It would be interesting if you had any data about this, since you seem like you would notice who behaves "better" and who tries every trick to get around blocks.

replies(1): >>42564061 #
14. petee ◴[] No.42552121[source]
Interesting idea, though I doubt they'd ever offer a reasonable amount for it. But doesn't it also change a sites legal stance if you're now selling your users content/data? I think it would also repel a number of users away from your service
15. qazxcvbnmlp ◴[] No.42552152{3}[source]
Everything on this can be programmatically simulated by a bot with bad intentions. It will be a cat and mouse game of finding behaviors that differentiate between bot and not and patching them.

To truly say “I trust real browsers” requires a signal of integrity of the user and browser such as cryptographic device attestation of the browser. .. which has to be centrally verified. Which is also not great.

replies(1): >>42552259 #
16. nedrocks ◴[] No.42552257[source]
This is one of the few interesting uses of crypto transactions at reasonable scale in the real world.
replies(3): >>42552580 #>>42553733 #>>42571945 #
17. coldpie ◴[] No.42552259{4}[source]
> Everything on this can be programmatically simulated by a bot with bad intentions. It will be a cat and mouse game of finding behaviors that differentiate between bot and not and patching them.

Forcing Facebook & Co to play the adversary role still seems like an improvement over the current situation. They're clearly operating illegitimately if they start spoofing real user agents to get around bot blocking capabilities.

replies(1): >>42552775 #
18. ◴[] No.42552452[source]
19. heavyarms ◴[] No.42552580{3}[source]
What mechanism would make it possible to enforce non-paywalled, non-authenticated access to public web pages? This is a classic "problem of the commons" type of issue.

The AI companies are signing deals with large media and publishing companies to get access to data without the threat of legal action. But nobody is going to voluntarily make deals with millions of personal blogs, vintage car forums, local book clubs, etc. and setup a micro payment system.

Any attempt to force some kind of micro payment or "prove you are not a robot" system will add a lot of friction for actual users and will be easily circumvented. If you are LinkedIn and you can devote a large portion of your R&D budget on this, you can maybe get it to work. But if you're running a blog on stamp collecting, you probably will not.

20. jprete ◴[] No.42552584[source]
I hope this is working out for you; the original article indicates that at least some of these crawlers move to innocuous user agent strings and change IPs if they get blocked or rate-limited.
21. jprete ◴[] No.42552608[source]
If you mean user-agent-wise, I think real users vary too much to do that.

That could also be a user login, maybe, with per-user rate limits. I expect that bot runners could find a way to break that, but at least it's extra engineering effort on their part, and they may not bother until enough sites force the issue.

22. smolder ◴[] No.42552764{3}[source]
We just need a browser plugin to auto-email webmasters to request access, and wait for the follow-up "access granted" email. It could be powered by AI.
replies(1): >>42561614 #
23. Terr_ ◴[] No.42552775{5}[source]
I'm imagining a quixotic terms of service, where "by continuing" any bot access grants the site-owner a perpetual and irrevocable license to use and relicense all data, works, or other products resulting from any use of the crawled content, including but not limited to cases where that content was used in a statistical text generative model.
24. iLoveOncall ◴[] No.42552786[source]
4.8M requests sounds huge, but if it's over 7 days and especially split amongst 30 websites, it's only a TPS of 0.26, not exactly very high or even abusive.

The fact that you choose to host 30 websites on the same instance is irrelevant, those AI bots scan websites, not servers.

This has been a recurring pattern I've seen in people complaining about AI bots crawling their website: huge number of requests but actually a low TPS once you dive a bit deeper.

replies(1): >>42552904 #
25. buro9 ◴[] No.42552904[source]
It's never that smooth.

In fact 2M requests arrived on December 23rd from Claude alone for a single site.

Average 25qps is definitely an issue, these are all long tail dynamic pages.

replies(1): >>42555164 #
26. oblio ◴[] No.42553733{3}[source]
Use the ex-hype to kill the new hype?

And the ex-hype would probably fail at that, too :-)

27. l1n ◴[] No.42555075{3}[source]
403 is generally a bad way to get crawlers to go away - https://developers.google.com/search/blog/2023/02/dont-404-m... suggests a 500, 503, or 429 HTTP status code.
replies(2): >>42562172 #>>42573039 #
28. l1n ◴[] No.42555164{3}[source]
Curious what your robots.txt looked like, if you have a link?
29. ndileas ◴[] No.42561614{4}[source]
Then someone will require a notarized statement of intent before you can read the recipe blog.
replies(1): >>42565694 #
30. vultour ◴[] No.42562172{4}[source]
That article describes the exact behaviour you want from the AI crawlers. If you let them know they’re rate limited they’ll just change IP or user agent.
31. Dilettante_ ◴[] No.42562204[source]
We'll have two entirely separate (dead) internets! One for real hosts who will only get machine users, and one for real users who only get machine content!

Wait, that seems disturbingly conceivable with the way things are going right now. *shudder*

32. Libcat99 ◴[] No.42564061{4}[source]
Switching to sending wrong, inexpensive data might be preferable to blocking them.

I've used this with voip scanners.

replies(1): >>42564776 #
33. buro9 ◴[] No.42564776{5}[source]
Oh I did this with the Facebook one and redirected them to a 100MB file of garbage that is part of the Cloudflare speed test... they hit this so many times that it would've been 2PB sent in a matter of hours.

I contacted the network team at Cloudflare to apologise and also to confirm whether Facebook did actually follow the redirect... it's hard for Cloudflare to see 2PB, that kind of number is too small on a global scale when it's occurred over a few hours, but given that it was only a single PoP that would've handled it, then it would've been visible.

It was not visible, which means we can conclude that Facebook were not following redirects, or if they were, they were just queuing it for later and would only hit it once and not multiple times.

replies(1): >>42572442 #
34. EVa5I7bHFq9mnYK ◴[] No.42565241[source]
>> there is little to no value in giving them access to the content

If you are an online shop, for example, isn't it beneficial that ChatGPT can recommend your products? Especially given that people now often consult ChatGPT instead of searching at Google?

replies(1): >>42566597 #
35. frereubu ◴[] No.42565694{5}[source]
Now we're talking. Some kind of requirement for government-issued ID too.
replies(1): >>42567965 #
36. Kuraj ◴[] No.42565862{3}[source]
I have not heard the word "webmaster" in such a long time
replies(1): >>42567278 #
37. rchaud ◴[] No.42566597[source]
> If you are an online shop, for example, isn't it beneficial that ChatGPT can recommend your products?

ChatGPT won't 'recommend' anything that wasn't already recommended in a Reddit post, or on an Amazon page with 5000 reviews.

You have however correctly spotted the market opportunity. Future versions of CGPT with offer the ability to "promote" your eshop in responses, in exchange for money.

38. rchaud ◴[] No.42566610[source]
No, because the price they'd offer would be insultingly low. The only way to get a good price is to take them to court for prior IP theft (as NYT and others have done), and get lawyers involved to work out a licensing deal.
39. frereubu ◴[] No.42567278{4}[source]
Deliberately chosen for the nostalgia value :)
40. ◴[] No.42567965{6}[source]
41. ranger207 ◴[] No.42571945{3}[source]
What does crypto add here that can't be accomplished with regular payments?
42. tliltocatl ◴[] No.42572442{6}[source]
Hmm, what about 1kb of carefully crafted gz-bomb? Or a TCP tarpit (this one would be a bit difficult to deploy).
43. buro9 ◴[] No.42573039{4}[source]
> 403 is generally a bad way to get crawlers to go away

Hardly... the article links says that a 403 will cause Google to stop crawling and remove content... that's the desired outcome.

I'm not trying to rate limit, I'm telling them to go away.

44. Pooge ◴[] No.42575268[source]
I didn't know about this. Thank you!

After some digging, I also found a great way to surprise bots that don't respect robots.txt[1] :)

[1]: https://melkat.blog/p/unsafe-pricing

45. gkbrk ◴[] No.42575828{3}[source]
This is Cloudflare with extra steps
46. phito ◴[] No.42594675[source]
Why not?