←back to thread

Cloudflare.com's Robots.txt

(www.cloudflare.com)
145 points sans_souse | 5 comments | | HN request time: 0.86s | source
1. ck2 ◴[] No.42165342[source]
easy guess that length breaks some legacy stuff

but every robots.txt should have a auto-ban trap line

ie. crawl it and die

basically a script that puts the requesting IP into firewall

of course it's possible to abuse that so it has to be monitored

replies(2): >>42165349 #>>42166539 #
2. okdood64 ◴[] No.42165349[source]
How do you discern a crawler agent and a human? Is it easily as the fact that they might cover something like 80%+ of the site in one visit fairly quickly?
replies(1): >>42165697 #
3. SoftTalker ◴[] No.42165697[source]
Crawlers/archivers will be hitting your site much faster than a human user.
4. johneth ◴[] No.42166539[source]
I thought about doing something like that, but then I realised: what if someone linked to the trap URL it from another site and a crawler followed that link to the trap?

You might end up penalising Googlebot or Bingbot.

If anyone knew what that trap URL did, and felt malicious, this could happen.

replies(1): >>42171194 #
5. CodesInChaos ◴[] No.42171194[source]
A crawler could easily avoid that by fetching the target domain's robots.txt before fetching the link target. However a website could also embed the honeypot link in an <img> tag and get the user banned when their browser attempts to load the image.