←back to thread

556 points campuscodi | 1 comments | | HN request time: 0.345s | source
Show context
jgrahamc ◴[] No.41867399[source]
My email is jgc@cloudflare.com. I'd like to hear from the owners of RSS readers directly on what they are experiencing. Going to ask team to take a closer look.
replies(7): >>41867476 #>>41867836 #>>41868190 #>>41868888 #>>41869258 #>>41869657 #>>41876633 #
1. kevincox ◴[] No.41868888[source]
I'll mail you as well but I think public discussion is helpful. Especially since I have seem similar responses to this over the years and it feels very disingenuous. The problem is very clear (Cloudflare serves 403 blocks to feed readers for no reason) you have all of the logs. The solution is maybe not trivial but I fail to see how the perspective of someone seeing a 403 block is going to help much. This just starts to sound like a way to seem responsive without actually doing anything.

From the feed reader perspective it is a 403 response. For example my reader has been trying to read https://blog.cloudflare.com/rss/ and the last successful response it got was on 2021-11-17. It has been backing off due to "errors" but it still is checking every 1-2 weeks and gets a 403 every time.

This obviously isn't limited to the Cloudflare blog, I see it on many site "protected by" (or in this case broken by) Cloudflare. I could tell you what public cloud IPs my reader comes from or which user-agent it uses but that is besides the point. This is a URL which is clearly intended for bots so it shouldn't be bot-blocked by default.

When people reach out to customer support we tell them that this is a bug for the site and there isn't much we can do. They can try contacting the site owner but this is most likely the default configuration of Cloudflare causing problems that the owner isn't aware of. I often recommend using a service like FeedBurner to proxy the request as these services seem to be on the whitelist of Cloudflare and other scraping prevention firewalls.

I think the main solution would be to detect intended-for-robots content and exclude it from scraping prevention by default (at least to a huge degree).

Another useful mechanism would be to allow these to be accessed when the target page is cachable, as the cache will protect the origin from overload-type DoS attacks anyways. Some care needs to be taken to ensure that adding a ?bust={random} query parameter can't break through to the origin but this would be a powerful tool for endpoints that need protection from overload but not against scraping (like RSS feeds). Unfortunately cache headers for feeds are far from universal, so this wouldn't fix all feeds on its own. (For example the Cloudflare blog's feed doesn't set any caching headers and is labeled as `cf-cache-status: DYNAMIC`.)