←back to thread

714 points blendergeek | 9 comments | | HN request time: 0.831s | source | bottom
1. kerkeslager ◴[] No.42727510[source]
Question: do these bots not respect robots.txt?

I haven't added these scrapers to my robots.txt on the sites I work on yet because I haven't seen any problems. I would run something like this on my own websites, but I can't see selling my clients on running this on their websites.

The websites I run generally have a honeypot page which is linked in the headers and disallowed to everyone in the robots.txt, and if an IP visits that page, they get added to a blocklist which simply drops their connections without response for 24 hours.

replies(4): >>42727689 #>>42727693 #>>42727959 #>>42751668 #
2. throw_m239339 ◴[] No.42727689[source]
> Question: do these bots not respect robots.txt?

No they don't, because there is no potential legal liability for not respecting that file in most countries.

3. jonatron ◴[] No.42727693[source]
You haven't seen any problems because you created a solution to the problem!
replies(1): >>42750514 #
4. 0xf00ff00f ◴[] No.42727959[source]
> The websites I run generally have a honeypot page which is linked in the headers and disallowed to everyone in the robots.txt, and if an IP visits that page, they get added to a blocklist which simply drops their connections without response for 24 hours.

I love this idea!

replies(1): >>42732436 #
5. griomnib ◴[] No.42732436[source]
Yeah, this is elegant as fuck.
6. kerkeslager ◴[] No.42750514[source]
Well, I wasn't the original developer who set up every site I work on. Some of the sites I work on don't have this implemented because I wasn't the one who set them up initially.
7. Dwedit ◴[] No.42751668[source]
Even something like a special URL that auto-bans you can be abused by pranksters. Simply embedding an <img> tag that fetches the offending URL could trigger it, as well as tricking people into clicking a link.
replies(2): >>42752781 #>>42755025 #
8. kerkeslager ◴[] No.42752781[source]
Ehhh, is there any reason I should be worried about that? The <img> tag would have to be in a spot where users are likely to go, otherwise users will never view the <img> tag. A link of any kind to the honeypot isn't likely to, for example, go viral on social media, because it's going to appear as a broken link/image and nobody will upvote it. I'm not seeing an attack vector that gets this link in front of my users with enough frequency to be worth considering.

A bigger concern is arguably users who are all behind the same IP address, i.e. some of the sites I work on have employee-only parts which can only be accessed via VPN, so in theory one employee could get the whole company banned, and that would be tricky to figure out. So far that hasn't been a problem, but now that I'm thinking about it, maybe I should have a whitelist override for that. :)

9. jesprenj ◴[] No.42755025[source]
This could be mitigited by having a special secret token in this honeypot URL that limits the time validity of the honeypot url and limits the IP address that this URL is for, let's say: hhtp://example/honeypot/hex(sha256(ipaddress | today(yyyy-mm-dd) | secret))

This special URL with the token would be in an anchor tag somewhere in the footer of every website, but hidden by a CSS rule and "Disallow: /honeypot" rule would be included in robots.txt.