Most active commenters

Popular/hot comments

>>42164094 #
>>42164616 #
>>42164810 #
>>42165448 #

Cloudflare.com's Robots.txt

(www.cloudflare.com)

1. jsheard ◴[17 Nov 24 13:27 UTC] No.42164090[source]▶

This is what happens if your robot isn't nice

  > curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com
  HTTP/2 403

replies(1): >>42164220 #

2. yapyap ◴[17 Nov 24 13:28 UTC] No.42164094[source]▶

>>42163883 (OP) #

That’s cool, if any scrapers would still respect the robots.txt that is

replies(4): >>42164168 #>>42165000 #>>42165017 #>>42165663 #

3. bityard ◴[17 Nov 24 13:48 UTC] No.42164168[source]▶

>>42164094 #

Think of robots.txt as less of a no trespassing sign and more of a, "You can visit but here are the rules to follow if you don't want to get shot" sign.

replies(2): >>42165338 #>>42165715 #

4. jamesog ◴[17 Nov 24 14:02 UTC] No.42164220[source]▶

>>42164090 #

That's not from robots.txt, but their Bot Management feature which blocks things calling themselves Googlebot that don't come from known Google IPs.

replies(1): >>42164616 #

5. palsecam ◴[17 Nov 24 14:14 UTC] No.42164269[source]▶

>>42163883 (OP) #

That’s a funny one!

Anyone knows of others like that?

Here is mine: https://FreeSolitaire.win/robots.txt

replies(2): >>42164327 #>>42165802 #

6. CodesInChaos ◴[17 Nov 24 14:18 UTC] No.42164291[source]▶

>>42163883 (OP) #

What's the purpose of "User-Agent: DemandbaseWebsitePreview/0.1"? I couldn't find anything about that agent, but I assume it's somehow related to demandbase.com?

But why are it and twitter the only whitelisted entries? Google and bing missing is a bit surprising, but I assume they're whitelisted through a different mechanism (like a google webmaster account)?

replies(2): >>42164338 #>>42164695 #

7. jsheard ◴[17 Nov 24 14:24 UTC] No.42164327[source]▶

>>42164269 #

Google used to have a /killer-robots.txt which forbid the T-1000 and T-800 from accessing Larry Page and Sergey Brin, but they took that down at some point.

replies(1): >>42165463 #

8. saddist0 ◴[17 Nov 24 14:26 UTC] No.42164338[source]▶

>>42164291 #

It is one of the service they use. As per the cookie policy page [1]:

> DemandBase - Enables us to identify companies who intend to purchase our products and solutions and deliver more relevant messages and offers to our Website visitors.

[1]: https://www.cloudflare.com/en-in/cookie-policy/

9. op00to ◴[17 Nov 24 14:46 UTC] No.42164446[source]▶

>>42163883 (OP) #

If those robots could read, they'd be very upset.

10. speedgoose ◴[17 Nov 24 15:21 UTC] No.42164616{3}[source]▶

>>42164220 #

Are GCP IPs considered Google IPs?

replies(3): >>42164648 #>>42164657 #>>42165651 #

11. crop_rotation ◴[17 Nov 24 15:25 UTC] No.42164648{4}[source]▶

>>42164616 #

No I am very sure they are not.

12. jgrahamc ◴[17 Nov 24 15:27 UTC] No.42164657{4}[source]▶

>>42164616 #

No.

13. Maken ◴[17 Nov 24 15:34 UTC] No.42164695[source]▶

>>42164291 #

My guess is that the Twitter one is for previews when you link to a web in Twitter.

14. orliesaurus ◴[17 Nov 24 15:54 UTC] No.42164810[source]▶

>>42163883 (OP) #

Has anyone worked on anything like this for AI scrapers?

replies(3): >>42165005 #>>42165055 #>>42165872 #

15. seanwilson ◴[17 Nov 24 16:08 UTC] No.42164898[source]▶

>>42163883 (OP) #

I have an ASCII art Easter egg like this in an SEO product I made. :)

https://www.checkbot.io/robots.txt

I should probably add this SEO tip too because the purpose of robots.txt is confusing: If you want to remove/deindex a page from Google search, you counterintuitively need to allow the page to be crawled in the robots.txt file, and then add a noindex response header or noindex meta tag to the page. This way the crawler gets to see the noindex instruction. Robots.txt controls which pages can be crawled, not which pages can be indexed.

replies(1): >>42165078 #

16. m-app ◴[17 Nov 24 16:18 UTC] No.42164972[source]▶

>>42163883 (OP) #

What does “OUR TREE IS A REDWOOD” refer to? A quick search doesn’t yield any definite results.

replies(1): >>42165185 #

17. dartos ◴[17 Nov 24 16:24 UTC] No.42165000[source]▶

>>42164094 #

I was surprised any ever did, honestly

18. dartos ◴[17 Nov 24 16:25 UTC] No.42165005[source]▶

>>42164810 #

A robots.txt that asks AI scrapers not to scrape?

There’s a couple services that keep updated lists of known scraper user agents. A quick search reveals a handful.

19. marginalia_nu ◴[17 Nov 24 16:26 UTC] No.42165017[source]▶

>>42164094 #

They may or may not, though respecting robots.txt is a nice way of not having your IP range end up on blacklists. With cloudflare in particular, that can be a bit of a pain.

They're pretty nice to deal with if you're upfront about what you are doing and clearly identify your bot, as well as register it with their bot detection. There's a form floating around somewhere for that.

20. zorked ◴[17 Nov 24 16:34 UTC] No.42165055[source]▶

>>42164810 #

https://github.com/ai-robots-txt/ai.robots.txt/blob/main/rob...

21. dazc ◴[17 Nov 24 16:38 UTC] No.42165078[source]▶

>>42164898 #

The consequences of robots.txt misuse can also be disastrous for a regular site. For example, I've seen instances where multiple warnings of 'page indexed but blocked by robots.txt' have led to sites being severely down-ranked as a consequence.

My assumption being that search engines don't want to be listing too many pages that everyone can read and they can not.

22. dlevine ◴[17 Nov 24 16:51 UTC] No.42165185[source]▶

>>42164972 #

California’s state tree is the redwood, and that’s where their HQ is.

replies(2): >>42165220 #>>42165448 #

23. ccorcos ◴[17 Nov 24 16:55 UTC] No.42165220{3}[source]▶

>>42165185 #

The tree shape a fairly inaccurate though

24. chrisweekly ◴[17 Nov 24 17:06 UTC] No.42165313[source]▶

>>42163883 (OP) #

One nice thing about CF's robots.txt is its inclusion of a sitemap:

https://www.cloudflare.com/sitemap.xml

which contains links to educational materials like

https://www.cloudflare.com/learning/ddos/layer-3-ddos-attack...

Potentially interesting to see their flattened IA....

replies(1): >>42165519 #

25. iterance ◴[17 Nov 24 17:11 UTC] No.42165338{3}[source]▶

>>42164168 #

If you do not respect the sign I shall be very cross with you. Very cross indeed. Perhaps I shall have to glare at you, yes, very hard. I think I shall glare at you. Perhaps if you are truly irritating I shall be forced to remove you from the premises for a bit.

26. ck2 ◴[17 Nov 24 17:12 UTC] No.42165342[source]▶

>>42163883 (OP) #

easy guess that length breaks some legacy stuff

but every robots.txt should have a auto-ban trap line

ie. crawl it and die

basically a script that puts the requesting IP into firewall

of course it's possible to abuse that so it has to be monitored

replies(2): >>42165349 #>>42166539 #

27. okdood64 ◴[17 Nov 24 17:14 UTC] No.42165349[source]▶

>>42165342 #

How do you discern a crawler agent and a human? Is it easily as the fact that they might cover something like 80%+ of the site in one visit fairly quickly?

replies(1): >>42165697 #

28. withinboredom ◴[17 Nov 24 17:19 UTC] No.42165391[source]▶

>>42165022 #

https://en.wikipedia.org/wiki/Roko%27s_basilisk ???

29. m-app ◴[17 Nov 24 17:26 UTC] No.42165448{3}[source]▶

>>42165185 #

Right, that makes sense. But why would you mention your state’s tree anywhere, and why specifically in your robots.txt? Seems pretty random.

replies(3): >>42165641 #>>42167363 #>>42175274 #

30. shaunpud ◴[17 Nov 24 17:29 UTC] No.42165463{3}[source]▶

>>42164327 #

https://web.archive.org/web/20160530160330/https://www.googl...

31. palsecam ◴[17 Nov 24 17:38 UTC] No.42165519[source]▶

>>42165313 #

Little-known fact: a syndication feed (RSS or Atom) can be used as a sitemap.

Quoting https://www.sitemaps.org/protocol.html#otherformats:

> The Sitemap protocol enables you to provide details about your pages to search engines, […] in addition to the XML protocol, we support RSS feeds and text files, which provide more limited information.

> You can provide an RSS (Real Simple Syndication) 2.0 or Atom 0.3 or 1.0 feed. Generally, you would use this format only if your site already has a syndication feed.

32. judge2020 ◴[17 Nov 24 17:56 UTC] No.42165641{4}[source]▶

>>42165448 #

State pride I suppose.

33. judge2020 ◴[17 Nov 24 17:58 UTC] No.42165651{4}[source]▶

>>42164616 #

For reference https://developers.google.com/search/docs/crawling-indexing/...

34. andrethegiant ◴[17 Nov 24 17:59 UTC] No.42165663[source]▶

>>42164094 #

FWIW, that’s why I’m working on a platform[1] to help devs deploy polite crawlers and scrapers out of the box that respect robots.txt (and 429s, Retry-After response headers, etc). It also happens to be entirely built on Cloudflare.

[1] https://crawlspace.dev

35. SoftTalker ◴[17 Nov 24 18:04 UTC] No.42165697{3}[source]▶

>>42165349 #

Crawlers/archivers will be hitting your site much faster than a human user.

36. blacksmith_tb ◴[17 Nov 24 18:06 UTC] No.42165715{3}[source]▶

>>42164168 #

There's a lot of talk of deregulation in the air, maybe we'll see Gibson-esque Black Ice, where rude crawlers provoke an automated DoS, a new Wild West.

37. segphault ◴[17 Nov 24 18:19 UTC] No.42165802[source]▶

>>42164269 #

Stripe has a humans.txt: https://stripe.com/humans.txt

replies(1): >>42171443 #

38. gnaman ◴[17 Nov 24 18:27 UTC] No.42165872[source]▶

>>42164810 #

https://llmstxt.org/ https://www.answer.ai/posts/2024-09-03-llmstxt.html

39. johneth ◴[17 Nov 24 19:46 UTC] No.42166539[source]▶

>>42165342 #

I thought about doing something like that, but then I realised: what if someone linked to the trap URL it from another site and a crawler followed that link to the trap?

You might end up penalising Googlebot or Bingbot.

If anyone knew what that trap URL did, and felt malicious, this could happen.

replies(1): >>42171194 #

40. NewJazz ◴[17 Nov 24 21:17 UTC] No.42167363{4}[source]▶

>>42165448 #

Have you seen a redwood? They can create quite the impression amongst people.

41. CodesInChaos ◴[18 Nov 24 10:20 UTC] No.42171194{3}[source]▶

>>42166539 #

A crawler could easily avoid that by fetching the target domain's robots.txt before fetching the link target. However a website could also embed the honeypot link in an <img> tag and get the user banned when their browser attempts to load the image.

42. ttepasse ◴[18 Nov 24 11:19 UTC] No.42171443{3}[source]▶

>>42165802 #

humans.txt is a small scale initiative: https://humanstxt.org/

43. SllX ◴[18 Nov 24 18:21 UTC] No.42175274{4}[source]▶

>>42165448 #

Redwoods are awesome.

↑