> curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com
HTTP/2 403
Anyone knows of others like that?
Here is mine: https://FreeSolitaire.win/robots.txt
But why are it and twitter the only whitelisted entries? Google and bing missing is a bit surprising, but I assume they're whitelisted through a different mechanism (like a google webmaster account)?
> DemandBase - Enables us to identify companies who intend to purchase our products and solutions and deliver more relevant messages and offers to our Website visitors.
https://www.checkbot.io/robots.txt
I should probably add this SEO tip too because the purpose of robots.txt is confusing: If you want to remove/deindex a page from Google search, you counterintuitively need to allow the page to be crawled in the robots.txt file, and then add a noindex response header or noindex meta tag to the page. This way the crawler gets to see the noindex instruction. Robots.txt controls which pages can be crawled, not which pages can be indexed.
They're pretty nice to deal with if you're upfront about what you are doing and clearly identify your bot, as well as register it with their bot detection. There's a form floating around somewhere for that.
My assumption being that search engines don't want to be listing too many pages that everyone can read and they can not.
https://www.cloudflare.com/sitemap.xml
which contains links to educational materials like
https://www.cloudflare.com/learning/ddos/layer-3-ddos-attack...
Potentially interesting to see their flattened IA....
but every robots.txt should have a auto-ban trap line
ie. crawl it and die
basically a script that puts the requesting IP into firewall
of course it's possible to abuse that so it has to be monitored
Quoting https://www.sitemaps.org/protocol.html#otherformats:
> The Sitemap protocol enables you to provide details about your pages to search engines, […] in addition to the XML protocol, we support RSS feeds and text files, which provide more limited information.
> You can provide an RSS (Real Simple Syndication) 2.0 or Atom 0.3 or 1.0 feed. Generally, you would use this format only if your site already has a syndication feed.
You might end up penalising Googlebot or Bingbot.
If anyone knew what that trap URL did, and felt malicious, this could happen.