Using Cloudflare on your website could be blocking RSS users

1. jgrahamc ◴[17 Oct 24 08:00 UTC] No.41867399[source]▶

My email is jgc@cloudflare.com. I'd like to hear from the owners of RSS readers directly on what they are experiencing. Going to ask team to take a closer look.

replies(7): >>41867476 #>>41867836 #>>41868190 #>>41868888 #>>41869258 #>>41869657 #>>41876633 #

2. viraptor ◴[17 Oct 24 08:15 UTC] No.41867476[source]▶

>>41867399 (TP) #

It's cool and all that you're making an exception here, but how about including a "no, really, I'm actually a human" link on the block page rather than giving the visitor a puzzle: how to report the issue to the page owner (hard on its own for normies) if you can't even load the page. This is just externalising issues that belong to the Cloudflare service.

replies(3): >>41867521 #>>41867531 #>>41873429 #

3. methou ◴[17 Oct 24 08:23 UTC] No.41867521[source]▶

>>41867476 #

Some clients are more like a bot/service, imagine google reader that fetches and caches content for you. The client I’m currently using is miniflux, it also works in this way.

I understand that there are some more interactive rss readers, but from personal experience it’s more like “hey I’m a good bot, let me in”

replies(2): >>41867783 #>>41867984 #

4. jgrahamc ◴[17 Oct 24 08:25 UTC] No.41867531[source]▶

>>41867476 #

I am not trying to "make an exception", I'm asking for information external to Cloudflare so I can look at what people are experiencing and compare with what our systems are doing and figure out what needs to improve.

replies(2): >>41867940 #>>41867994 #

5. _Algernon_ ◴[17 Oct 24 09:13 UTC] No.41867783{3}[source]▶

>>41867521 #

An rss reader is a user agent (ie. a software acting on behalf of its users). If you define rss readers as a bot (even if it is a good bot), you may as well call Firefox a bot (it also sends off web requests without explicit approval of each request by the browser).

replies(1): >>41867953 #

6. kalib_tweli ◴[17 Oct 24 09:24 UTC] No.41867836[source]▶

>>41867399 (TP) #

There are email obfuscation and managed challenge script tags being injected into the RSS feed.

You simply shouldn't have any challenges whatsoever on an RSS feed. They're literally meant to be read by a machine.

replies(2): >>41868120 #>>41874073 #

7. robertlagrant ◴[17 Oct 24 09:43 UTC] No.41867940{3}[source]▶

>>41867531 #

This is useful info: https://news.ycombinator.com/item?id=33675847

8. sofixa ◴[17 Oct 24 09:44 UTC] No.41867953{4}[source]▶

>>41867783 #

Their point was that the RSS reader does the scraping on its own in the background, without user input. If it can't read the page, it can't; it's not initiated by the user where the user can click on a "I'm not a bot, I promise" button.

9. viraptor ◴[17 Oct 24 09:51 UTC] No.41867984{3}[source]▶

>>41867521 #

It was a mental skip, but the same idea. It would awesome if CF just allowed reporting issues at the point something gets blocked - regardless if it's a human or a bot. They're missing an "I'm misclassified" button for people actually affected without the third-party runaround.

replies(1): >>41869745 #

10. PaulRobinson ◴[17 Oct 24 09:52 UTC] No.41867994{3}[source]▶

>>41867531 #

Some "bots" are legitimate. RSS is intended for machine consumption. You should not be blocking content intended for machine consumption because a machine is attempting to consume it. You should not expect a machine, consuming content intended for a machine, to do some sort of step to show they aren't a machine, because they are in fact a machine. There is a lot of content on the internet that is not used by humans, and so checking that humans are using it is an aggressive anti-pattern that ruins experiences for millions of people.

It's not that hard. If the content being requested is RSS (or Atom, or some other syndication format intended for consumption by software), just don't do bot checks, use other mechanisms like rate limiting if you must stop abuse.

As an example: would you put a captcha on robots.txt as well?

As other stories here can attest to, Cloudflare is slowly killing off independent publishing on the web through poor product management decisions and technology implementations, and the fix seems pretty simple.

replies(1): >>41868866 #

11. kalib_tweli ◴[17 Oct 24 10:14 UTC] No.41868120[source]▶

>>41867836 #

I confirmed that if you explicitly set the Content-Type response header to application/rss+xml it seems to work with Cloudflare Proxy enabled.

The issue here is that Cloudflare's content type check is naive. And the fact that CF is checking the content-type header directly needs to be made more explicit OR they need to do a file type check.

replies(1): >>41868798 #

12. prmoustache ◴[17 Oct 24 10:27 UTC] No.41868190[source]▶

>>41867399 (TP) #

It is not only rss reader users that are affected. Any user with some extension to block trackers get regularly forbidden access to websites or have to deal with tons of captcha.

replies(1): >>41872137 #

13. londons_explore ◴[17 Oct 24 11:58 UTC] No.41868798{3}[source]▶

>>41868120 #

I wonder if popular software for generating RSS feeds might not be setting the correct content-type header? Maybe this whole issue could be mostly-fixed by a few github PR's...

replies(4): >>41869066 #>>41869112 #>>41869113 #>>41877322 #

14. jamespo ◴[17 Oct 24 12:08 UTC] No.41868866{4}[source]▶

>>41867994 #

From another post, if the content-type is correct it gets through. If this is the case I don't see the problem.

replies(1): >>41873102 #

15. kevincox ◴[17 Oct 24 12:11 UTC] No.41868888[source]▶

>>41867399 (TP) #

I'll mail you as well but I think public discussion is helpful. Especially since I have seem similar responses to this over the years and it feels very disingenuous. The problem is very clear (Cloudflare serves 403 blocks to feed readers for no reason) you have all of the logs. The solution is maybe not trivial but I fail to see how the perspective of someone seeing a 403 block is going to help much. This just starts to sound like a way to seem responsive without actually doing anything.

From the feed reader perspective it is a 403 response. For example my reader has been trying to read https://blog.cloudflare.com/rss/ and the last successful response it got was on 2021-11-17. It has been backing off due to "errors" but it still is checking every 1-2 weeks and gets a 403 every time.

This obviously isn't limited to the Cloudflare blog, I see it on many site "protected by" (or in this case broken by) Cloudflare. I could tell you what public cloud IPs my reader comes from or which user-agent it uses but that is besides the point. This is a URL which is clearly intended for bots so it shouldn't be bot-blocked by default.

When people reach out to customer support we tell them that this is a bug for the site and there isn't much we can do. They can try contacting the site owner but this is most likely the default configuration of Cloudflare causing problems that the owner isn't aware of. I often recommend using a service like FeedBurner to proxy the request as these services seem to be on the whitelist of Cloudflare and other scraping prevention firewalls.

I think the main solution would be to detect intended-for-robots content and exclude it from scraping prevention by default (at least to a huge degree).

Another useful mechanism would be to allow these to be accessed when the target page is cachable, as the cache will protect the origin from overload-type DoS attacks anyways. Some care needs to be taken to ensure that adding a ?bust={random} query parameter can't break through to the origin but this would be a powerful tool for endpoints that need protection from overload but not against scraping (like RSS feeds). Unfortunately cache headers for feeds are far from universal, so this wouldn't fix all feeds on its own. (For example the Cloudflare blog's feed doesn't set any caching headers and is labeled as `cf-cache-status: DYNAMIC`.)

16. kalib_tweli ◴[17 Oct 24 12:36 UTC] No.41869066{4}[source]▶

>>41868798 #

It wouldn't. It's the role of the HTTP server to set the correct content type header.

17. onli ◴[17 Oct 24 12:41 UTC] No.41869113{4}[source]▶

>>41868798 #

Correct might be debatable here as well. My blog for example sets Content-Type to text/xml, which is not exactly wrong for an RSS feed (after all, it is text and XML) and IIRC was the default back then.

There were compatibility issues with other type headers, at least in the past.

replies(1): >>41869959 #

18. djbusby ◴[17 Oct 24 12:41 UTC] No.41869112{4}[source]▶

>>41868798 #

The number of feeds with crap headers and other non-spec stuff going on; and loads of clients missing useful headers. Ugh. It seems like it should be simple; maybe that's why there are loads of naive implementations.

19. is_true ◴[17 Oct 24 12:58 UTC] No.41869258[source]▶

>>41867399 (TP) #

Maybe when you detect urls that return the rss mimetype notify the owner of the site/CF account that it might be a good idea to allow bots on that urls.

Ideally you could make it a simple switch in the config, somethin like: "Allow automated access on RSS endpoints".

20. badlibrarian ◴[17 Oct 24 13:49 UTC] No.41869657[source]▶

>>41867399 (TP) #

Thank you for showing up here and being open to feedback. But I have to ask: shouldn't Cloudflare be running and reviewing reports to catch this before it became such a problem? It's three clicks in Tableau for anyone who cares, and clearly nobody does. And this isn't the first time something like this has slipped through the cracks.

I tried reaching out to Cloudflare with issues like this in the past. The response is dozens of employees hitting my LinkedIn page yet no responses to basic, reproduceable technical issues.

You need to fix this internally as it's a reputational problem now. Less screwing around using Salesforce as your private Twitter, more leadership in triage. Your devs obviously aren't motivated to fix this stuff independently and for whatever reason they keep breaking the web.

replies(1): >>41869841 #

21. fluidcruft ◴[17 Oct 24 13:58 UTC] No.41869745{4}[source]▶

>>41867984 #

Unfortunately, I would expect that queue of reports to get flooded by bad faith actors.

replies(1): >>41872200 #

22. 015a ◴[17 Oct 24 14:09 UTC] No.41869841[source]▶

>>41869657 #

The reality that HackerNews denizens need to accept, in this case and in a more general form, is: RSS feeds are not popular. They aren't just unpopular in the way that, say, Peacock is unpopular relative to Netflix; they're truly unpopular, used regularly by a number of people that could fit in an american football stadium. There are younger software engineers at Cloudflare that have never heard the term "RSS" before, and have no notion of what it is. It will probably be dead technology in ten years.

I'm not saying this to say its a good thing; it isn't.

Here's something to consider though: Why are we going after Cloudflare for this? Isn't the website operator far, far more at-fault? They chose Cloudflare. They configure Cloudflare. They, in theory, publish an RSS feed, which is broken because of infrastructure decisions they made. You're going after Ryobi because you've got a leaky pipe. But beyond that: isn't this tool Cloudflare publishes doing exactly what the website operators intended it to do? It blocks non-human traffic. RSS clients are non-human traffic. Maybe the reason you don't want to go after the website operators is because you know you're in the wrong? Why can't these RSS clients detect when they encounter this situation, and prompt the user with a captive portal to get past it?

replies(1): >>41870255 #

23. johneth ◴[17 Oct 24 14:23 UTC] No.41869959{5}[source]▶

>>41869113 #

I think the current correct content types are:

'application/rss+xml' (for RSS)

'application/atom+xml' (for Atom)

replies(2): >>41870071 #>>41873080 #

24. londons_explore ◴[17 Oct 24 14:34 UTC] No.41870071{6}[source]▶

>>41869959 #

Sounds like a kind samaritan could write a scanner to find as many RSS feeds as possible which look like RSS/Atom and don't have these content types, then go and patch the hosting software those feeds use to have the correct content types, or ask the webmasters to fix it if they're home-made sites.

As soon as a majority of sites use the correct types, clients can start requiring it for newly added feeds, which in turn will make webmasters make it right if they want their feed to work.

25. badlibrarian ◴[17 Oct 24 14:55 UTC] No.41870255{3}[source]▶

>>41869841 #

I'm old enough to remember Dave Winer taking Feedburner to task for inserting crap into RSS feeds that broke his code.

There will always be niche technologies and nascent standards and we're taking Cloudflare to task today because if they continue to stomp on them, we get nowhere.

"Don't use Cloudflare" is an option, but we can demand both.

replies(2): >>41871226 #>>41871910 #

26. gjsman-1000 ◴[17 Oct 24 16:39 UTC] No.41871226{4}[source]▶

>>41870255 #

"Old man yells at cloud about how the young'ns don't appreciate RSS."

I mean that somewhat sarcastically; but there does come a point where the demands are unreasonable, the technology is dead. There are probably more people browsing with JavaScript disabled than using RSS feeds. There are probably more people browsing on Windows XP than using RSS feeds. Do I yell at you because your personal blog doesn't support IE6 anymore?

replies(1): >>41872081 #

27. 015a ◴[17 Oct 24 17:52 UTC] No.41871910{4}[source]▶

>>41870255 #

I'm not backing down on this one: This is straight up an "old man yelling at the kids to get off his lawn" situation, and the fact that JGC from Cloudflare is in here saying "we'll take a look at this" is so far and beyond what anyone reasonable would expect of them that they deserve praise and nothing else.

This is a matter between You and the Website Operators, period. Cloudflare has nothing to do with this. This article puts "Cloudflare" in the title because its fun to hate on Cloudflare and it gets upvotes. Cloudflare is a tool. These website operators are using Cloudflare The Tool to block inhuman access to their websites. RSS CLIENTS ARE NOT HUMAN. Let me repeat that: Cloudflare's bot detection is working fully appropriately here, because RSS Clients are Bots. Everything here is working as expected. The part where change should be asked is: Website operators should allow inhuman actors past the Cloudflare bot detection firewall specifically for RSS feeds. They can FULLY DO THIS. Cloudflare has many, many knobs and buttons that Website Operators can tweak; one of those is e.g. a page rule to turn off bot detection for specific routes, such as `/feed.xml`.

If your favorite website is not doing this, its NOT CLOUDFLARE'S FAULT.

Take it up with the Website Operators, Not Cloudflare. Or, build an RSS Client which supports a captive portal to do human authorization. God this is so boring, y'all just love shaking your first and yelling at big tech for LITERALLY no reason. I suspect its actually because half of y'all are concerningly uneducated on what we're talking about.

replies(3): >>41872176 #>>41873230 #>>41875740 #

28. badlibrarian ◴[17 Oct 24 18:08 UTC] No.41872081{5}[source]▶

>>41871226 #

Spotify and Apple Podcasts use RSS feeds to update what they show in their apps. And even if millions of people weren't dependent on it, suggesting that an infrastructure provider not fix a bug only makes the web worse.

29. ◴[17 Oct 24 18:13 UTC] No.41872137[source]▶

>>41868190 #

30. badlibrarian ◴[17 Oct 24 18:17 UTC] No.41872176{5}[source]▶

>>41871910 #

As part of proxying what may be as much as 20% of the web, Cloudflare injects code and modifies content that passes between clients and servers. It is in their core business interests to receive and act upon feedback regarding this functionality.

replies(1): >>41874505 #

31. viraptor ◴[17 Oct 24 18:19 UTC] No.41872200{5}[source]▶

>>41869745 #

Sure, but now they say that queue should go to the website owner instead, who has less global visibility on the traffic. So that's just ignoring something they don't want to deal with.

32. onli ◴[17 Oct 24 19:45 UTC] No.41873080{6}[source]▶

>>41869959 #

Not even Cloudflares own blog uses those, https://blog.cloudflare.com/rss/, or am I getting a wrong content-type shown in my dev tools? For me it is `application/xml`. So even if `application/rss+xml` were the correct type by an official spec, it's not something to rely on if it's not used commonly.

replies(1): >>41873190 #

33. Scramblejams ◴[17 Oct 24 19:49 UTC] No.41873102{5}[source]▶

>>41868866 #

It's a very common misconfiguration, though, because it happens by default when setting up CF. If your customers are, by default, configuring things incorrectly, then it's reasonable to ask if the service should surface the issue more proactively in an attempt to help customers get it right.

As another commenter noted, not even CF's own RSS feed seems to get the content type right. This issue could clearly use some work.

34. johneth ◴[17 Oct 24 20:00 UTC] No.41873190{7}[source]▶

>>41873080 #

I just checked Wikipedia and it says Atom's is 'application/atom+xml' (also confirmed in the IANA registry), and RSS's is 'application/rss+xml' (but it's not registered yet, and 'text/xml' is also used widely).

'application/rss+xml' seems to be the best option though in my opinion. The '+xml' in the media type tells (good) parsers to fall back to using an XML parser if they don't understand the 'rss' part, but the 'rss' part provides more accurate information on the content's type for parsers that do understand RSS.

All that said, it's a mess.

35. 627467 ◴[17 Oct 24 20:05 UTC] No.41873230{5}[source]▶

>>41871910 #

What's does cloudflare do to search crawlers by default? Does it block them too?

36. doctor_radium ◴[17 Oct 24 20:27 UTC] No.41873429[source]▶

>>41867476 #

I had a conversation with a web site owner about this once. There apparently is such a feature, a way for sites to configure a "Please contact us here if you're having trouble reaching our site" page...usage of which I assume Cloudflare could track and then gain better insight into these issues. The problem? It requires a Premium Plan.

37. o11c ◴[17 Oct 24 21:36 UTC] No.41874073[source]▶

>>41867836 #

Even outside of RSS, the injected scripts often make internet security significantly worse.

Since the user-agent has no way to distinguish scripts injected by cloudflare from scripts originating from the actual website, in order to pass the challenge they are forced to execute arbitrary code from an untrusted party. And malicious Javascript is practically ubiquitous on the general internet.

38. 015a ◴[17 Oct 24 22:35 UTC] No.41874505{6}[source]▶

>>41872176 #

Sure: Let's begin by not starting the conversation with "Don't use Cloudflare", as you did. That's obviously not only unhelpful, but it clearly points the finger at the wrong party.

39. doctor_radium ◴[18 Oct 24 02:00 UTC] No.41875740{5}[source]▶

>>41871910 #

I get what you're saying, and on a philosophical level you're probably right. If a website owner misconfigures their CDN to the point of impeding legitimate traffic then they can fail like businesses do everyday. Survival of the fittest. But with the majority of web users apparently running stock Chrome, on a practical level the web still has to work. I went looking for car parts a number of months ago and was blocked/accosted by firewalls over 50% of the time. Not all Cloudflare-powered sites. There isn't enough time in the day to take every misconfigured site to task (unless you're Bowerick Wowbagger [1]), so I believe the solution will eventually have to be either an altruistic effort from Cloudflare or from government regulation.

[1] https://www.wowbagger.com/chapter1.htm

40. quinncom ◴[18 Oct 24 05:39 UTC] No.41876633[source]▶

>>41867399 (TP) #

Cloudflare-enabled websites have had this issue for years.[1] The problem is that website owners are not educated enough to understand that URLs meant for bots should not enable Cloudflare’s bot blocker.

Perhaps a solution would be for Cloudflare to have default page rules that disable bot-blocking features for common RSS feed URLs? Or pop-up a notice with instructions on how to create these page rules to users that appear to have RSS feeds on their website?

[1] Here is Overcast’s owner raising the issue in 2022: https://x.com/OvercastFM/status/1578755654587940865

41. Klonoar ◴[18 Oct 24 08:10 UTC] No.41877322{4}[source]▶

>>41868798 #

Quite a few feeds out there use the incorrect type of text/xml, since it works slightly better in browsers by not prompting a download.

Would not surprise me if Cloudflare lumps this in with text/html protections.