(www.cloudflare.com)

145 points sans_souse | 5 comments | 17 Nov 24 12:39 UTC | HN request time: 0.639s | source

1. ck2 ◴[17 Nov 24 17:12 UTC] No.42165342[source]▶

easy guess that length breaks some legacy stuff

but every robots.txt should have a auto-ban trap line

ie. crawl it and die

basically a script that puts the requesting IP into firewall

of course it's possible to abuse that so it has to be monitored

replies(2): >>42165349 #>>42166539 #

2. okdood64 ◴[17 Nov 24 17:14 UTC] No.42165349[source]▶

>>42165342 (TP) #

How do you discern a crawler agent and a human? Is it easily as the fact that they might cover something like 80%+ of the site in one visit fairly quickly?

replies(1): >>42165697 #

3. SoftTalker ◴[17 Nov 24 18:04 UTC] No.42165697[source]▶

>>42165349 #

Crawlers/archivers will be hitting your site much faster than a human user.

4. johneth ◴[17 Nov 24 19:46 UTC] No.42166539[source]▶

>>42165342 (TP) #

I thought about doing something like that, but then I realised: what if someone linked to the trap URL it from another site and a crawler followed that link to the trap?

You might end up penalising Googlebot or Bingbot.

If anyone knew what that trap URL did, and felt malicious, this could happen.

replies(1): >>42171194 #

5. CodesInChaos ◴[18 Nov 24 10:20 UTC] No.42171194[source]▶

>>42166539 #

A crawler could easily avoid that by fetching the target domain's robots.txt before fetching the link target. However a website could also embed the honeypot link in an <img> tag and get the user banned when their browser attempts to load the image.

↑

Cloudflare.com's Robots.txt