←back to thread

770 points ta988 | 4 comments | | HN request time: 0.001s | source
Show context
buro9 ◴[] No.42551470[source]
Their appetite cannot be quenched, and there is little to no value in giving them access to the content.

I have data... 7d from a single platform with about 30 forums on this instance.

4.8M hits from Claude 390k from Amazon 261k from Data For SEO 148k from Chat GPT

That Claude one! Wowser.

Bots that match this (which is also the list I block on some other forums that are fully private by default):

(?i).(AhrefsBot|AI2Bot|AliyunSecBot|Amazonbot|Applebot|Awario|axios|Baiduspider|barkrowler|bingbot|BitSightBot|BLEXBot|Buck|Bytespider|CCBot|CensysInspect|ChatGPT-User|ClaudeBot|coccocbot|cohere-ai|DataForSeoBot|Diffbot|DotBot|ev-crawler|Expanse|FacebookBot|facebookexternalhit|FriendlyCrawler|Googlebot|GoogleOther|GPTBot|HeadlessChrome|ICC-Crawler|imagesift|img2dataset|InternetMeasurement|ISSCyberRiskCrawler|istellabot|magpie-crawler|Mediatoolkitbot|Meltwater|Meta-External|MJ12bot|moatbot|ModatScanner|MojeekBot|OAI-SearchBot|Odin|omgili|panscient|PanguBot|peer39_crawler|Perplexity|PetalBot|Pinterestbot|PiplBot|Protopage|scoop|Scrapy|Screaming|SeekportBot|Seekr|SemrushBot|SeznamBot|Sidetrade|Sogou|SurdotlyBot|Timpibot|trendictionbot|VelenPublicWebCrawler|WhatsApp|wpbot|xfa1|Yandex|Yeti|YouBot|zgrab|ZoominfoBot).

I am moving to just blocking them all, it's ridiculous.

Everything on this list got itself there by being abusive (either ignoring robots.txt, or not backing off when latency increased).

replies(9): >>42551536 #>>42551667 #>>42551719 #>>42551781 #>>42551798 #>>42551877 #>>42552584 #>>42552786 #>>42565241 #
ai-christianson ◴[] No.42551781[source]
Would you consider giving these crawlers access if they paid you?
replies(5): >>42551804 #>>42552121 #>>42552257 #>>42552452 #>>42566610 #
1. nedrocks ◴[] No.42552257[source]
This is one of the few interesting uses of crypto transactions at reasonable scale in the real world.
replies(3): >>42552580 #>>42553733 #>>42571945 #
2. heavyarms ◴[] No.42552580[source]
What mechanism would make it possible to enforce non-paywalled, non-authenticated access to public web pages? This is a classic "problem of the commons" type of issue.

The AI companies are signing deals with large media and publishing companies to get access to data without the threat of legal action. But nobody is going to voluntarily make deals with millions of personal blogs, vintage car forums, local book clubs, etc. and setup a micro payment system.

Any attempt to force some kind of micro payment or "prove you are not a robot" system will add a lot of friction for actual users and will be easily circumvented. If you are LinkedIn and you can devote a large portion of your R&D budget on this, you can maybe get it to work. But if you're running a blog on stamp collecting, you probably will not.

3. oblio ◴[] No.42553733[source]
Use the ex-hype to kill the new hype?

And the ex-hype would probably fail at that, too :-)

4. ranger207 ◴[] No.42571945[source]
What does crypto add here that can't be accomplished with regular payments?