Most active commenters

    ←back to thread

    1343 points Hold-And-Modify | 24 comments | | HN request time: 0.002s | source | bottom

    Hello.

    Cloudflare's Browser Intergrity Check/Verification/Challenge feature used by many websites, is denying access to users of non-mainstream browsers like Pale Moon.

    Users reports began on January 31:

    https://forum.palemoon.org/viewtopic.php?f=3&t=32045

    This situation occurs at least once a year, and there is no easy way to contact Cloudflare. Their "Submit feedback" tool yields no results. A Cloudflare Community topic was flagged as "spam" by members of that community and was promptly locked with no real solution, and no official response from Cloudflare:

    https://community.cloudflare.com/t/access-denied-to-pale-moo...

    Partial list of other browsers that are being denied access:

    Falkon, SeaMonkey, IceCat, Basilisk.

    Hacker News 2022 post about the same issue, which brought attention and had Cloudflare quickly patching the issue:

    https://news.ycombinator.com/item?id=31317886

    A Cloudflare product manager declared back then: "...we do not want to be in the business of saying one browser is more legitimate than another."

    As of now, there is no official response from Cloudflare. Internet access is still denied by their tool.

    Show context
    windsignaling ◴[] No.42955454[source]
    As a website owner and VPN user I see both sides of this.

    On one hand, I get the annoying "Verify" box every time I use ChatGPT (and now due its popularity, DeepSeek as well).

    On the other hand, without Cloudflare I'd be seeing thousands of junk requests and hacking attempts everyday, people attempting credit card fraud, etc.

    I honestly don't know what the solution is.

    replies(15): >>42955722 #>>42955733 #>>42956022 #>>42956059 #>>42956088 #>>42956502 #>>42957016 #>>42957235 #>>42959074 #>>42959436 #>>42959515 #>>42959590 #>>42963545 #>>42963562 #>>42966987 #
    1. rozap ◴[] No.42956059[source]
    What is a "junk" request? Is it hammering an expensive endpoint 5000 times per second, or just somebody using your website in a way you don't like? I've also been on both sides of it (on-call at 3am getting dos'd is no fun), but I think the danger here is that we've gotten to a point where a new google can't realistically be created.

    The thing is that these tools are generally used to further entrench power that monopolies, duopolies, and cartels already have. Example: I've built an app that compares grocery prices as you make a shopping list, and you would not believe the extent that grocers go to to make price comparison difficult. This thing doesn't make thousands or even hundreds of requests - maybe a few dozen over the course of a day. What I thought would be a quick little project has turned out to be wildly adversarial. But now spite driven development is a factor so I will press on.

    It will always be a cat and mouse game, but we're at a point where the cat has a 46 billion dollar market cap and handles a huge portion of traffic on the internet.

    replies(6): >>42956167 #>>42956187 #>>42957017 #>>42957174 #>>42957266 #>>42964848 #
    2. makeitdouble ◴[] No.42956167[source]
    > somebody using your website in a way you don't like?

    This usually includes people making a near-realtime updated perfect copy of your site and serving that copy for either scam or middle-manning transactions or straight fraud.

    Having a clear category of "good bots" from either a verified or accepted companies would help for these cases. Cloudflare has such a system I think, but then a new search engine would have to go to each and every platform provider to make deals and that also sounds impossible.

    replies(1): >>42960779 #
    3. jeroenhd ◴[] No.42956187[source]
    I've such bots on my server. Some Chinese Huawei bot as well as an American one.

    They ignored robots.txt (claimed not to, but I blacklisted them there and they didn't stop) and started randomly generating image paths. At some point /img/123.png became /img/123.png?a=123 or whatever, and they just kept adding parameters and subpaths for no good reason. Nginx dutifully ignored the extra parameters and kept sending the same images files over and over again, wasting everyone's time and bandwidth.

    I was able to block these bots by just blocking the entire IP range at the firewall level (for Huawei I had to block all of China Telecom and later a huge range owned by Tencent for similar reasons).

    I have lost all faith in scrapers. I've written my own scrapers too, but almost all of the scrapers I've come across are nefarious. Some scour the internet searching for personal data to sell, some look for websites to send hack attempts at to brute force bug bounty programs, others are just scraping for more AI content. Until the scraping industry starts behaving, I can't feel bad for people blocking these things even if they hurt small search engines.

    replies(3): >>42956660 #>>42960711 #>>42961964 #
    4. x3haloed ◴[] No.42956660[source]
    Honestly, it should just come down to rate limiting and what you’re willing to serve and to whom. If you’re a free information idealist like me, I’m OK with bots accessing public web-serving servers, but not OK with allowing them to consume all my bandwidth and compute cycles. Furthermore, I’m also not OK with legitimate users consuming all my resources. So I should employ strategies that prevent individual clients or groups of clients from endlessly submitting requests, whether the format of the requests make sense or are “junk.”
    replies(1): >>42956823 #
    5. makeitdouble ◴[] No.42956823{3}[source]
    Rate limiting doesn't help if the requests are split under hundreds of sessions. Especially if your account creation process was also bot friendly.

    Fundamentally it's adversarial, so expecting a single simple concept to properly cover even half of the problematic requests is unrealistic.

    replies(3): >>42959582 #>>42959598 #>>42959697 #
    6. to11mtm ◴[] No.42957017[source]
    I'll give a fun example from the past.

    I used to work at a company that did auto inspections. (e.x. if you turned a lease in, did a trade in on a used car, private party, etc.)

    Because of that, we had a server that contained 'condition reports', as well as the images that went through those condition reports.

    Mind you, sometimes condition reports had to be revised. Maybe a photo was bad, maybe the photos were in the wrong order, etc.

    It was a perfect storm:

    - The Image caching was all inmem

    - If an image didn't exist, the server would error with a 500

    - IIS was set up such that too many errors caused a recycle

    - Some scraper was working off a dataset (that ironically was 'corrected' in an hour or so) but contained an image that did not exist.

    - The scraper, instead of eventually 'moving on' would keep retrying the URL.

    It was the only time that org had an 'anyone who thinks they can help solve please attend' meeting at the IT level.

    > and you would not believe the extent that grocers go to to make price comparison difficult. This thing doesn't make thousands or even hundreds of requests - maybe a few dozen over the course of a day.

    Very true. I'm reminded of Oren Eini's tale of building an app to compare grocery prices in Israel, where apparently mandated supermarket chains to publish prices [0]. On top of even the government mandate for data sharing appearing to hit the wrong over/under for formatting, There's the constant issue of 'incomparabilities'.

    And it's weird, because it immediately triggered memories of how 20-ish years ago, one of the most accessible Best Buy's was across the street from a Circuit City, but good luck price matching because the stores all happened to sell barely different laptops/desktops (e.x. up the storage but use a lower grade CPU) so that nobody really had to price match.

    [0] - https://ayende.com/blog/170978/the-business-process-of-compa...

    replies(1): >>42964421 #
    7. ohcmon ◴[] No.42957174[source]
    Actually, I think creating google alternative has never been as doable as it is today.
    8. OptionOfT ◴[] No.42957266[source]
    > and you would not believe the extent that grocers go to to make price comparison difficult. This thing doesn't make thousands or even hundreds of requests - maybe a few dozen over the course of a day.

    It's gonna get even worse. Walmart & Kroger are implementing digital price tags, so whatever you see on the website will probably (purposefully?) be out of date by the time you get to the store.

    Stores don't want you to compare.

    replies(2): >>42957609 #>>42959323 #
    9. rozap ◴[] No.42957609[source]
    Originally I was excited to see that kroger had an API, until just about the first thing that the ToS said was "you can't use this for price comparison".

    And yea, I imagine dynamic pricing will make things even more complicated.

    That being said, that's why this feature isn't built into the billion shopping list apps that are out there. Because it's a pain.

    replies(1): >>42959726 #
    10. _blk ◴[] No.42959323[source]
    So you put something in your cart and by the time you reach the cashier the price doubled? Sounds like someone is about to patent price locking when you add an item to your pysical shopping cart.
    11. amatecha ◴[] No.42959582{4}[source]
    Rate limiting could help when an automated process is scanning arbitrary, generated URLs, inevitably generating a shitton of 404 errors -- something your rate limiting logic can easily check for (depending on server/proxy software of course). Normal users or even normal bots won't generate excessive 404's in a short time frame, so that's potentially a pretty simple metric by which apply a rate limit. Just an idea though, I've not done that myself...
    replies(1): >>42960013 #
    12. ◴[] No.42959598{4}[source]
    13. ghxst ◴[] No.42959697{4}[source]
    Rate limiting based on IP, blocking obvious datacenter ASNs and blocking identifiable JA3 fingerprints is quite simple and surprisingly effective in stopping most scrapers and can be done entirely server side, I wouldn't be surprised if this catches more than half of problematic requests to the average website. But I agree that if you have a website "worth" scraping there will probably be some individuals motivated enough to bypass those restrictions.
    replies(1): >>42960252 #
    14. unethical_ban ◴[] No.42959726{3}[source]
    Price comparison should be required by law. In fact, I think it would be interesting for a city to require its major grocers to feed pricing information to a public database.
    15. ku1ik ◴[] No.42960013{5}[source]
    I did that and it works great.

    Specifically, I use fail2ban to count the 404s and ban the IP temporarily when certain threshold is exceeded in a given time frame. Every time I check fail2ban stats it has hundreds of IPs blocked.

    replies(1): >>42961837 #
    16. dmantis ◴[] No.42960252{5}[source]
    > blocking obvious datacenter ASNs

    You block all VPN users then, and currently many countries have some kind of censorship, please don't do that. I use a personal VPN for over 5 years and that's annoying.

    I understand the other side and captcha/POW captchas/additional checks is okay. But give people a choice to be private/non-censorable.

    Enabling/disabling a VPN each minute to access the non-censored local site which blocks datacenters IPs, then bringing it back again for the general surfing is a bit of a hell.

    replies(1): >>42960453 #
    17. ghxst ◴[] No.42960453{6}[source]
    That's a fair point, probably the best approach would be to do a client side challenge where the server side challenge fails but at that point it's no longer as simple of a setup. Toggling a VPN is definitely annoying but a captcha or something like POW do come with an impact to user experience as well and in my experience are easier (and cheaper) to deal with for bots, a good quality residential proxy where you pay per GB quickly becomes a lot more expensive than a captcha solver service or the compute for a POW challenge.
    replies(1): >>42960498 #
    18. dmantis ◴[] No.42960498{7}[source]
    Yes, but you can use captcha/POW challenges based on IP reputation, which leaves usual users intact. I don't mind captchas too much, that's my choice to use the VPN.

    What I mean is that it's better to give VPN users the choice to solve captchas instead of being banned completely.

    19. DocTomoe ◴[] No.42960711[source]
    Sounds like a problem easily solved with fail2ban. Which keeps legitimate folks in, and offenders out - and also unbans after a set amount of time, to avoid dynamic IPs screwing over legitimate users permanently.
    20. Terr_ ◴[] No.42960779[source]
    I'd settle for some kind of "proof of investment" in a bot-identity, so that I know blocking that identity is impactful, and it's not just one of a billion tiny throwaways.

    In other words, knowing who someone is isn't strictly necessary, provided they have "skin the game" to encourage proper behavior.

    21. zepearl ◴[] No.42961837{6}[source]
    Same here - fail2ban then adds the IP to my nftables fw
    22. MatthiasPortzel ◴[] No.42961964[source]
    Why not just ignore the bots? I have a Linode VPS, cheapest tier, and I get 1TB of network transfer a month. The bots that you're concerned about use a tiny fraction of that (<1%). I'm not behind a CDN and I've never put effort into banning at the IP level or setting up fail2ban.

    I get that there might be some feeling of righteous justice that comes from removing these entries from your Nginx logs, but it also seems like there's a lot of self-induced stress that comes from monitoring failed Nginx and ssh logs.

    23. _factor ◴[] No.42964421[source]
    Best Buy will also sell identical hardware with a slightly modified SKU and negligible changes to avoid comparison.

    It’s difficult to compare when BB is the “only” company that sells a particular item.

    24. tempodox ◴[] No.42964848[source]
    +1 for spite-driven development.