Nepenthes is a tarpit to catch AI web crawlers

(zadzmo.org)

714 points blendergeek | 1 comments | 16 Jan 25 13:57 UTC | HN request time: 0.195s | source

Show context

kerkeslager ◴[16 Jan 25 16:32 UTC] No.42727510[source]▶

Question: do these bots not respect robots.txt?

I haven't added these scrapers to my robots.txt on the sites I work on yet because I haven't seen any problems. I would run something like this on my own websites, but I can't see selling my clients on running this on their websites.

The websites I run generally have a honeypot page which is linked in the headers and disallowed to everyone in the robots.txt, and if an IP visits that page, they get added to a blocklist which simply drops their connections without response for 24 hours.

replies(4): >>42727689 #>>42727693 #>>42727959 #>>42751668 #

Dwedit ◴[18 Jan 25 21:46 UTC] No.42751668[source]▶

>>42727510 #

Even something like a special URL that auto-bans you can be abused by pranksters. Simply embedding an <img> tag that fetches the offending URL could trigger it, as well as tricking people into clicking a link.

replies(2): >>42752781 #>>42755025 #

1. jesprenj ◴[19 Jan 25 08:18 UTC] No.42755025[source]▶

>>42751668 #

This could be mitigited by having a special secret token in this honeypot URL that limits the time validity of the honeypot url and limits the IP address that this URL is for, let's say: hhtp://example/honeypot/hex(sha256(ipaddress | today(yyyy-mm-dd) | secret))

This special URL with the token would be in an anchor tag somewhere in the footer of every website, but hidden by a CSS rule and "Disallow: /honeypot" rule would be included in robots.txt.

↑