I understand that there are some more interactive rss readers, but from personal experience it’s more like “hey I’m a good bot, let me in”
You simply shouldn't have any challenges whatsoever on an RSS feed. They're literally meant to be read by a machine.
It's not that hard. If the content being requested is RSS (or Atom, or some other syndication format intended for consumption by software), just don't do bot checks, use other mechanisms like rate limiting if you must stop abuse.
As an example: would you put a captcha on robots.txt as well?
As other stories here can attest to, Cloudflare is slowly killing off independent publishing on the web through poor product management decisions and technology implementations, and the fix seems pretty simple.
The issue here is that Cloudflare's content type check is naive. And the fact that CF is checking the content-type header directly needs to be made more explicit OR they need to do a file type check.
From the feed reader perspective it is a 403 response. For example my reader has been trying to read https://blog.cloudflare.com/rss/ and the last successful response it got was on 2021-11-17. It has been backing off due to "errors" but it still is checking every 1-2 weeks and gets a 403 every time.
This obviously isn't limited to the Cloudflare blog, I see it on many site "protected by" (or in this case broken by) Cloudflare. I could tell you what public cloud IPs my reader comes from or which user-agent it uses but that is besides the point. This is a URL which is clearly intended for bots so it shouldn't be bot-blocked by default.
When people reach out to customer support we tell them that this is a bug for the site and there isn't much we can do. They can try contacting the site owner but this is most likely the default configuration of Cloudflare causing problems that the owner isn't aware of. I often recommend using a service like FeedBurner to proxy the request as these services seem to be on the whitelist of Cloudflare and other scraping prevention firewalls.
I think the main solution would be to detect intended-for-robots content and exclude it from scraping prevention by default (at least to a huge degree).
Another useful mechanism would be to allow these to be accessed when the target page is cachable, as the cache will protect the origin from overload-type DoS attacks anyways. Some care needs to be taken to ensure that adding a ?bust={random} query parameter can't break through to the origin but this would be a powerful tool for endpoints that need protection from overload but not against scraping (like RSS feeds). Unfortunately cache headers for feeds are far from universal, so this wouldn't fix all feeds on its own. (For example the Cloudflare blog's feed doesn't set any caching headers and is labeled as `cf-cache-status: DYNAMIC`.)
There were compatibility issues with other type headers, at least in the past.
Ideally you could make it a simple switch in the config, somethin like: "Allow automated access on RSS endpoints".
I tried reaching out to Cloudflare with issues like this in the past. The response is dozens of employees hitting my LinkedIn page yet no responses to basic, reproduceable technical issues.
You need to fix this internally as it's a reputational problem now. Less screwing around using Salesforce as your private Twitter, more leadership in triage. Your devs obviously aren't motivated to fix this stuff independently and for whatever reason they keep breaking the web.
I'm not saying this to say its a good thing; it isn't.
Here's something to consider though: Why are we going after Cloudflare for this? Isn't the website operator far, far more at-fault? They chose Cloudflare. They configure Cloudflare. They, in theory, publish an RSS feed, which is broken because of infrastructure decisions they made. You're going after Ryobi because you've got a leaky pipe. But beyond that: isn't this tool Cloudflare publishes doing exactly what the website operators intended it to do? It blocks non-human traffic. RSS clients are non-human traffic. Maybe the reason you don't want to go after the website operators is because you know you're in the wrong? Why can't these RSS clients detect when they encounter this situation, and prompt the user with a captive portal to get past it?
As soon as a majority of sites use the correct types, clients can start requiring it for newly added feeds, which in turn will make webmasters make it right if they want their feed to work.
There will always be niche technologies and nascent standards and we're taking Cloudflare to task today because if they continue to stomp on them, we get nowhere.
"Don't use Cloudflare" is an option, but we can demand both.
I mean that somewhat sarcastically; but there does come a point where the demands are unreasonable, the technology is dead. There are probably more people browsing with JavaScript disabled than using RSS feeds. There are probably more people browsing on Windows XP than using RSS feeds. Do I yell at you because your personal blog doesn't support IE6 anymore?
This is a matter between You and the Website Operators, period. Cloudflare has nothing to do with this. This article puts "Cloudflare" in the title because its fun to hate on Cloudflare and it gets upvotes. Cloudflare is a tool. These website operators are using Cloudflare The Tool to block inhuman access to their websites. RSS CLIENTS ARE NOT HUMAN. Let me repeat that: Cloudflare's bot detection is working fully appropriately here, because RSS Clients are Bots. Everything here is working as expected. The part where change should be asked is: Website operators should allow inhuman actors past the Cloudflare bot detection firewall specifically for RSS feeds. They can FULLY DO THIS. Cloudflare has many, many knobs and buttons that Website Operators can tweak; one of those is e.g. a page rule to turn off bot detection for specific routes, such as `/feed.xml`.
If your favorite website is not doing this, its NOT CLOUDFLARE'S FAULT.
Take it up with the Website Operators, Not Cloudflare. Or, build an RSS Client which supports a captive portal to do human authorization. God this is so boring, y'all just love shaking your first and yelling at big tech for LITERALLY no reason. I suspect its actually because half of y'all are concerningly uneducated on what we're talking about.
As another commenter noted, not even CF's own RSS feed seems to get the content type right. This issue could clearly use some work.
'application/rss+xml' seems to be the best option though in my opinion. The '+xml' in the media type tells (good) parsers to fall back to using an XML parser if they don't understand the 'rss' part, but the 'rss' part provides more accurate information on the content's type for parsers that do understand RSS.
All that said, it's a mess.
Since the user-agent has no way to distinguish scripts injected by cloudflare from scripts originating from the actual website, in order to pass the challenge they are forced to execute arbitrary code from an untrusted party. And malicious Javascript is practically ubiquitous on the general internet.
Perhaps a solution would be for Cloudflare to have default page rules that disable bot-blocking features for common RSS feed URLs? Or pop-up a notice with instructions on how to create these page rules to users that appear to have RSS feeds on their website?
[1] Here is Overcast’s owner raising the issue in 2022: https://x.com/OvercastFM/status/1578755654587940865