Nepenthes is a tarpit to catch AI web crawlers

1. jsheard ◴[16 Jan 25 15:39 UTC] No.42726726[source]▶

This is a really bad take, it's not like this server is hacking clients which connect to it. It's providing perfectly valid HTTP responses that just happen to be slow and full of markov gibberish, any harm which comes of that is self inflicted by assuming that websites must provide valuable data as a matter of course.

If AI companies want to sue webmasters for that then by all means, they can waste their money and get laughed out of court.

replies(3): >>42726813 #>>42726898 #>>42729375 #

2. ◴[16 Jan 25 15:47 UTC] No.42726841{3}[source]▶

>>42726813 #

3. bwfan123 ◴[16 Jan 25 15:51 UTC] No.42726898[source]▶

>>42726726 #

yea, it comes across as an extremely entitled mobster take.

heads i win, tails you lose. we own all your content, and you better behave.

i can bet this is incentive-speak.

replies(1): >>42727262 #

4. grajaganDev ◴[16 Jan 25 15:52 UTC] No.42726912[source]▶

>>42726591 (TP) #

Are you a lawyer?

replies(1): >>42727516 #

5. blibble ◴[16 Jan 25 15:56 UTC] No.42726981{3}[source]▶

>>42726813 #

> If you want to protect your content, use the technical mechanisms that are available,

> You can choose to gatekeep your content, and by doing so, make it unscrapeable, and legally protected.

so... robots.txt, which the AI parasites ignore?

> Also, consider that relatively small, cheap llms are able to parse the difference between meaningful content and Markovian jabber such as this software produces.

okay, so it's not damaging, and there you've refuted your entire argument

replies(1): >>42727385 #

6. jsheard ◴[16 Jan 25 16:43 UTC] No.42727632{5}[source]▶

>>42727385 #

> No, put up a loginwall or paywall, authenticate users, and go private.

We know for a fact that AI companies don't respect that, if they want data that's behind a paywall then they'll jump through hoops to take it anyway.

https://www.theguardian.com/technology/2025/jan/10/mark-zuck...

If they don't have to abide by "norms" then we don't have to for their sake. Fuck 'em.

replies(1): >>42727731 #

7. blibble ◴[16 Jan 25 16:58 UTC] No.42727804{7}[source]▶

>>42727731 #

this is a very US-ian view of the world

my site is not in the US, I am not a US citizen. US law does not apply to me.

under UK law: robots.txt is an access control mechanism (weak or otherwise)

knowingly bypassing it is likely a criminal offence under the Computer Misuse Act

good luck suing me because you got stuck when you smashed my window and climbed through it

8. tofof ◴[16 Jan 25 17:12 UTC] No.42728003{3}[source]▶

>>42726813 #

He's not interfering with any normal operation of any system. He is offering links. You can follow them or not, entirely at your own discretion. Those links load slowly. You can wait for them to complete or not, entirely at your own discretion.

The crawler's normal operation is not interfered with in any way: the crawler does exactly what it's programmed to do. If its programmers decided it should exhaustively follow links, he's not preventing it from doing that operation.

Legally, at best you'd be looking to warp the concept of attractive nuisance to apply to a crawler. As that legal concept is generally intended to prevent bodily harm to children, however, good luck.

9. tofof ◴[16 Jan 25 17:21 UTC] No.42728123{4}[source]▶

>>42727262 #

Please provide a citation for a law that proscribes me from publically offering a service which consumes time while it is voluntarily engaged with.

10. tir ◴[16 Jan 25 17:26 UTC] No.42728192{7}[source]▶

>>42727731 #

>the law explicitly allows scraping and crawling.

Nepenthes also allows scraping and crawling, for as long as you like.

11. jazzyjackson ◴[16 Jan 25 18:58 UTC] No.42729369{3}[source]▶

>>42727516 #

I broadly agree with what you're trying to get across here, but I don't see why I can't set my own standards for what use of my server is authorized or not.

If I publish content at my domain, I can set up blocklists to refuse access to IP ranges I consider more likely to be malicious than not. Is that not already breaking the social contract you're pointing to wrt serving content public ? picking and choosing which parts of the public will get a response from my server ? (I would also be interested to know if there is actual law vs social contracts around behavior) So why shouldn't I be able enforce expectations on how my server is used? The vigilantism aspect of harming the person breaking the rules is another matter, I'm on the fence.

Consider the standard warning posted to most government sites, which is more or less a "no trespassing sign" [0] informing anyone accessing the system what their expectations should be and what counts as authorized use. I suppose it's not a legally binding contract to say "you agree to these terms by requesting this url" but I'm pretty sure convictions have happened with hackers who did not have a contract with the service provider.

[0] https://ir.nist.gov/

12. jazzyjackson ◴[16 Jan 25 18:59 UTC] No.42729375[source]▶

>>42726726 #

I guess it's an unpopular take but I don't see why it was flagged. It's a good point of discussion.