←back to thread

770 points ta988 | 1 comments | | HN request time: 0s | source
Show context
buro9 ◴[] No.42551470[source]
Their appetite cannot be quenched, and there is little to no value in giving them access to the content.

I have data... 7d from a single platform with about 30 forums on this instance.

4.8M hits from Claude 390k from Amazon 261k from Data For SEO 148k from Chat GPT

That Claude one! Wowser.

Bots that match this (which is also the list I block on some other forums that are fully private by default):

(?i).(AhrefsBot|AI2Bot|AliyunSecBot|Amazonbot|Applebot|Awario|axios|Baiduspider|barkrowler|bingbot|BitSightBot|BLEXBot|Buck|Bytespider|CCBot|CensysInspect|ChatGPT-User|ClaudeBot|coccocbot|cohere-ai|DataForSeoBot|Diffbot|DotBot|ev-crawler|Expanse|FacebookBot|facebookexternalhit|FriendlyCrawler|Googlebot|GoogleOther|GPTBot|HeadlessChrome|ICC-Crawler|imagesift|img2dataset|InternetMeasurement|ISSCyberRiskCrawler|istellabot|magpie-crawler|Mediatoolkitbot|Meltwater|Meta-External|MJ12bot|moatbot|ModatScanner|MojeekBot|OAI-SearchBot|Odin|omgili|panscient|PanguBot|peer39_crawler|Perplexity|PetalBot|Pinterestbot|PiplBot|Protopage|scoop|Scrapy|Screaming|SeekportBot|Seekr|SemrushBot|SeznamBot|Sidetrade|Sogou|SurdotlyBot|Timpibot|trendictionbot|VelenPublicWebCrawler|WhatsApp|wpbot|xfa1|Yandex|Yeti|YouBot|zgrab|ZoominfoBot).

I am moving to just blocking them all, it's ridiculous.

Everything on this list got itself there by being abusive (either ignoring robots.txt, or not backing off when latency increased).

replies(9): >>42551536 #>>42551667 #>>42551719 #>>42551781 #>>42551798 #>>42551877 #>>42552584 #>>42552786 #>>42565241 #
pogue ◴[] No.42551667[source]
What do you use to block them?
replies(1): >>42551696 #
buro9 ◴[] No.42551696[source]
Nginx, it's nothing special it's just my load balancer.

if ($http_user_agent ~* (list|of|case|insensitive|things|to|block)) {return 403;}

replies(2): >>42552020 #>>42555075 #
gs17 ◴[] No.42552020[source]
From the article:

> If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really).

It would be interesting if you had any data about this, since you seem like you would notice who behaves "better" and who tries every trick to get around blocks.

replies(1): >>42564061 #
Libcat99 ◴[] No.42564061[source]
Switching to sending wrong, inexpensive data might be preferable to blocking them.

I've used this with voip scanners.

replies(1): >>42564776 #
buro9 ◴[] No.42564776[source]
Oh I did this with the Facebook one and redirected them to a 100MB file of garbage that is part of the Cloudflare speed test... they hit this so many times that it would've been 2PB sent in a matter of hours.

I contacted the network team at Cloudflare to apologise and also to confirm whether Facebook did actually follow the redirect... it's hard for Cloudflare to see 2PB, that kind of number is too small on a global scale when it's occurred over a few hours, but given that it was only a single PoP that would've handled it, then it would've been visible.

It was not visible, which means we can conclude that Facebook were not following redirects, or if they were, they were just queuing it for later and would only hit it once and not multiple times.

replies(1): >>42572442 #
1. tliltocatl ◴[] No.42572442[source]
Hmm, what about 1kb of carefully crafted gz-bomb? Or a TCP tarpit (this one would be a bit difficult to deploy).