What are you protecting cloudflare?
Also they show those captchas when going to robots.txt... unbelievable.
Cloudflare's Browser Intergrity Check/Verification/Challenge feature used by many websites, is denying access to users of non-mainstream browsers like Pale Moon.
Users reports began on January 31:
https://forum.palemoon.org/viewtopic.php?f=3&t=32045
This situation occurs at least once a year, and there is no easy way to contact Cloudflare. Their "Submit feedback" tool yields no results. A Cloudflare Community topic was flagged as "spam" by members of that community and was promptly locked with no real solution, and no official response from Cloudflare:
https://community.cloudflare.com/t/access-denied-to-pale-moo...
Partial list of other browsers that are being denied access:
Falkon, SeaMonkey, IceCat, Basilisk.
Hacker News 2022 post about the same issue, which brought attention and had Cloudflare quickly patching the issue:
https://news.ycombinator.com/item?id=31317886
A Cloudflare product manager declared back then: "...we do not want to be in the business of saying one browser is more legitimate than another."
As of now, there is no official response from Cloudflare. Internet access is still denied by their tool.
What are you protecting cloudflare?
Also they show those captchas when going to robots.txt... unbelievable.
It is either that or keep sending data back to the Meta and Co. overlords despite me not being a Facebook, Instagram, Whatsapp user...
It's also a pretty safe assumption that Cloudflare is not run by morons, and they have access to more data than we do, by virtue of being the strip club bouncer for half the Internet.
This hostility to normal browsing behavior makes me extremely reluctant to ever use Cloudflare on any projects.
Absolutely true. But the programmers of these bots are lazy and often don't. So if Cloudflare has access to other data that can positively identify bots, and there is a high correlation with a particular user agent, well then it's a good first-pass indication despite collateral damage from false positives.
If you really do have a better way to make all legitimate users of sites happy with bot protections then by all means there is a massive market for this. Unfortunately you're probably more like me, stuck between a rock and a hard place of being in a situation where we have no good solution and just annoyance with the way things are.
They do not - not definitively [1]. This cat-and-mouse game is stochastic at higher levels, with bots doing their best to blend in with regular traffic, and the defense trying to pick up signals barely above the noise floor. There are diminishing returns to battling bots that are indistinguishable from regular users.
1. A few weeks ago, the HN frontpage had a browser-based project that claimed to be undetectable
Turnstile is the in-page captcha option, which you're right, does affect page load. But they force a defer on the loading of that JS as best they can.
Also, turnstile is a Proof of Work check, and is meant to slow down & verify would-be attack vectors. Turnstile should only be used on things like Login, email change, "place order", etc.
For example even Cloudflare hasn't configure their official blog's RSS feed properly. My feed reader (running in a DigitalOcean datacenter) hasn't been able to access it since 2021 (403 every time even though backed off to checking weekly). This is a cachable endpoint with public data intended for robots. If they can't configure their own product correctly for their official blog how can they expect other sites to?
A cheeky response is "their profit margins", but I don't think that quite right considering that their earnings per share is $-0.28.
I've not looked into Cloudflare much, I've never needed their services, so I'm not totally sure on what all their revenue streams are. I have heard that small websites are not paying much if anything at all [1]. With that preface out of the way–I think that we see challenges on sites that perhaps don't need them as a form of advertising, to ensure that their name is ever-present. Maybe they don't need this form of advertising, or maybe they do.
I'd presumed it was just the VM they're heuristically detecting but sounds like some are experiencing issues on Linux in general.
If you are writing some kind of malicious crawler that doesn't care about rate-limiting, and wants to scan as many sites as possible for the most vulnerable to get a list together to hack, you will scan robots.txt because that is the file that tells robots NOT to index these pages. I never use a robots.txt for some kind of security through obscurity. I've only ever bothered with robots.txt to make SEO easier when you can control a virtual subdirectory of a site, to block things like repeated content with alternative layouts (to avoid duplicate content issues), or to get a section of a website to drop out of SERPs for discontinued sections of a site.
This is not relevant because Cloudflare will cache it so it never hits your origin. Unless they are adding random URL parameters (which you can teach Cloudflare to ignore but I don't think that should be a default configuration).
Again, I think you are correct with more sane defaults, but I don't know if you've ever dealt with a network admin or web administrator that hasn't dealt with server-side caching vs. browser caching, but it most definitely would end up with Cloudflare losing sales because people misunderstood how things work. Maybe I'm jaded, at 45, but I feel like most people don't even know to look at headers by default when they feel they hit a caching issue. I don't think it's based on age, I think it's based on being interested in the technology and wanting to learn all about it. Mostly developers that got into it for the love of technology, versus those that got into it because it was high paying and they understood Excel, or learned to build a simple website early in life, so everyone told them to get into software.
Half these imbeciles don't even change the user-agent from the scraper they downloaded off GitHub.
I employ lots of filtering so it's possible the data is skewed towards those that sneak through the sieve - but they've already been caught, so it's meaningless.
I scrape hundreds of cloudflare protected sites every 15 minutes, without ever having any issues, using a simple headless browser and mobile connection, meanwhile real users get interstitial pages.
It's almost like Cloudflare is deliberately showing the challenge to real users just to show that they exist and are doing "something".
So it's OK for them to do shitty things without explaining themselves because they "have access to more data than we do"? Big companies can be mysterious and non-transparent because they're big?
What a take!
Also, Turnstile is definitely not a simple proof of work check, and performs browser fingerprinting and checks for web APIs. You can easily check this by changing your browser's user-agent at the header level and leave it as-is at the header level; this puts Turnstile into an infinite loop.
Looks like there’s a plugin for that https://chromewebstore.google.com/detail/user-agent-switcher...
This approach clearly blocks bots so it's not enough to say "just don't ever do things which have false positives" and it's a bit silly to say "just don't ever do the things which have false positives, but for my specific false positives only - leave the other methods please!"
Somehow, Safari passes it the first time. WTF?
"Google is adding code to Chrome that will send tamper-proof information about your operating system and other software, and share it with websites. Google says this will reduce ad fraud. In practice, it reduces your control over your own computer, and is likely to mean that some websites will block access for everyone who's not using an "approved" operating system and browser."
https://www.eff.org/deeplinks/2023/08/your-computer-should-s...