←back to thread

Cloudflare.com's Robots.txt

(www.cloudflare.com)
145 points sans_souse | 7 comments | | HN request time: 0.408s | source | bottom
1. yapyap ◴[] No.42164094[source]
That’s cool, if any scrapers would still respect the robots.txt that is
replies(4): >>42164168 #>>42165000 #>>42165017 #>>42165663 #
2. bityard ◴[] No.42164168[source]
Think of robots.txt as less of a no trespassing sign and more of a, "You can visit but here are the rules to follow if you don't want to get shot" sign.
replies(2): >>42165338 #>>42165715 #
3. dartos ◴[] No.42165000[source]
I was surprised any ever did, honestly
4. marginalia_nu ◴[] No.42165017[source]
They may or may not, though respecting robots.txt is a nice way of not having your IP range end up on blacklists. With cloudflare in particular, that can be a bit of a pain.

They're pretty nice to deal with if you're upfront about what you are doing and clearly identify your bot, as well as register it with their bot detection. There's a form floating around somewhere for that.

5. iterance ◴[] No.42165338[source]
If you do not respect the sign I shall be very cross with you. Very cross indeed. Perhaps I shall have to glare at you, yes, very hard. I think I shall glare at you. Perhaps if you are truly irritating I shall be forced to remove you from the premises for a bit.
6. andrethegiant ◴[] No.42165663[source]
FWIW, that’s why I’m working on a platform[1] to help devs deploy polite crawlers and scrapers out of the box that respect robots.txt (and 429s, Retry-After response headers, etc). It also happens to be entirely built on Cloudflare.

[1] https://crawlspace.dev

7. blacksmith_tb ◴[] No.42165715[source]
There's a lot of talk of deregulation in the air, maybe we'll see Gibson-esque Black Ice, where rude crawlers provoke an automated DoS, a new Wild West.