←back to thread

714 points blendergeek | 1 comments | | HN request time: 0s | source
Show context
kerkeslager ◴[] No.42727510[source]
Question: do these bots not respect robots.txt?

I haven't added these scrapers to my robots.txt on the sites I work on yet because I haven't seen any problems. I would run something like this on my own websites, but I can't see selling my clients on running this on their websites.

The websites I run generally have a honeypot page which is linked in the headers and disallowed to everyone in the robots.txt, and if an IP visits that page, they get added to a blocklist which simply drops their connections without response for 24 hours.

replies(4): >>42727689 #>>42727693 #>>42727959 #>>42751668 #
Dwedit ◴[] No.42751668[source]
Even something like a special URL that auto-bans you can be abused by pranksters. Simply embedding an <img> tag that fetches the offending URL could trigger it, as well as tricking people into clicking a link.
replies(2): >>42752781 #>>42755025 #
1. jesprenj ◴[] No.42755025[source]
This could be mitigited by having a special secret token in this honeypot URL that limits the time validity of the honeypot url and limits the IP address that this URL is for, let's say: hhtp://example/honeypot/hex(sha256(ipaddress | today(yyyy-mm-dd) | secret))

This special URL with the token would be in an anchor tag somewhere in the footer of every website, but hidden by a CSS rule and "Disallow: /honeypot" rule would be included in robots.txt.