Most active commenters

    ←back to thread

    Anubis Works

    (xeiaso.net)
    313 points evacchi | 11 comments | | HN request time: 1.119s | source | bottom
    Show context
    perching_aix ◴[] No.43668684[source]
    > Sadly, you must enable JavaScript to get past this challenge. This is required because AI companies have changed the social contract around how website hosting works. A no-JS solution is a work-in-progress.

    Will be interested to hear of that. In the meantime, at least I learned of JShelter.

    Edit:

    Why not use the passage of time as the limiter? I guess it would still require JS though, unless there's some hack possible with CSS animations, like request an image with certain URL params only after an animation finishes.

    This does remind me how all of these additional hoops are making web browsing slow.

    Edit #2:

    Thinking even more about it, time could be made a hurdle by just.. slowly serving incoming requests. No fancy timestamp signing + CSS animations or whatever trickery required.

    I'm also not sure if time would make at-scale scraping as much more expensive as PoW does. Time is money, sure, but that much? Also, the UX of it I'm not sold on, but could be mitigated somewhat by doing news website style "I'm only serving the first 20% of my content initially" stuff.

    So yeah, will be curious to hear the non-JS solution. The easy way out would be a browser extension, but then it's not really non-JS, just JS compartmentalized, isn't it?

    Edit #3:

    Turning reasoning on for a moment, this whole thing is a bit iffy.

    First of all, the goal is that a website operator would be able to control the use of information they disseminate to the general public via their website, such that it won't be used specifically for AI training. In principle, this is nonsensical. The goal of sharing information with the general public (so, people) involves said information eventually traversing through a non-technological medium (air, as light), to reach a non-technological entity (a person). This means that any technological measure will be limited to before that medium, and won't be able to affect said target either. Put differently, I can rote copy your website out into a text editor, or hold up a camera with OCR and scan the screen, if scale is needed.

    So in principle we're definitely hosed, but in practice you can try to hold onto the modality of "scraping for AI training" by leveraging the various technological fingerprints of such activity, which is how we get to at-scale PoW. But then this also combats any other kind of at-scale scraping, such as search engines. You could whitelist specific search engines, but then you're engaging in anti-competitive measures, since smaller third party search engines now have to magically get themselves on your list. And even if they do, they might be lying about being just a search engine, because e.g. Google may scrape your website for search, but will 100% use it for AI training then too.

    So I don't really see any technological modality that would be able properly discriminate AI training purposed scraping traffic for you to use PoW or other methods against. You may decide to engage in this regardless based on statistical data, and just live with the negative aspects of your efforts, but then it's a bit iffy.

    Finally, what about the energy consumption shaped elephant in the room? Using PoW for this is going basically exactly against the spirit of wanting less energy to be spent on AI and co. That said, this may not be a goal for the author.

    The more I think about this, the less sensible and agreeable it is. I don't know man.

    replies(3): >>43668835 #>>43668958 #>>43668998 #
    1. marginalia_nu ◴[] No.43668835[source]
    You basically need proof-of-work to make this work. Idling a connection is not computationally expensive, so is not a deterrent.

    It's a shitty solution to an even shittier reality.

    replies(1): >>43668939 #
    2. xena ◴[] No.43668939[source]
    Main author of Anubis here:

    Basically what they said. This is a hack, and it's specifically designed to exploit the infrastructure behind industrial-scale scraping. They usually have a different IP address do the scraping for each page load _but share the cookies between them_. This means that if they use headless chrome, they have to do the proof of work check every time, which scales poorly with the rates I know the headless chrome vendors charge for compute time per page.

    replies(4): >>43669203 #>>43670066 #>>43670076 #>>43671025 #
    3. lifthrasiir ◴[] No.43669203[source]
    Do you think that, if this behavior of Anubis gets well-known and Anubis cookies are specifically handled to avoid pathological PoW checks, does Anubis need a significant rework? Because if it's indeed true this hack wouldn't last much longer and I have no further idea to avoid user-visible annoyances.
    replies(1): >>43669329 #
    4. solid_fuel ◴[] No.43669329{3}[source]
    Well, if they rework things so that requests all originate from the same IP address or a small set of addresses, then regular IP-based rate limits should work fine right?

    The point is just to stop what is effectively a DDoS because of shitty web crawlers, not to stop the crawling entirely.

    replies(2): >>43669664 #>>43669746 #
    5. lifthrasiir ◴[] No.43669664{4}[source]
    > Well, if [...], then regular IP-based rate limits should work fine right?

    I'm not sure. IP-based rate limits have a well-known issue with shared public IPs for example. Technically they are also more resource-intensive than cryptographic approaches too (but I don't think that's not a big issue in IPv4).

    6. dharmab ◴[] No.43669746{4}[source]
    > then regular IP-based rate limits should work fine right?

    These are also harmful to human users, who are often behind CGNAT and may be sharing a pool of IPs with many thousands of other ISP subscribers.

    7. specialist ◴[] No.43670066[source]
    > Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers

    Based on the comments here, it seems like many people are struggling with the concept.

    Would calling Anubis a "client-side rate limiter" be accurate (enough)?

    replies(1): >>43670813 #
    8. vhcr ◴[] No.43670076[source]
    I used to have an ISP that would load balance your connection between different providers, this meant that pretty much every single request would use a different IP. I know it's not that common, but that would mean real users would find pages using anubis unusable.
    9. runxiyu ◴[] No.43670813{3}[source]
    Probably not
    10. ArinaS ◴[] No.43671025[source]
    Is there any particular date/time you'll introduce a no-JS solution?

    And are you going to support older browsers? I tested Anubis with https://www.browserling.com with its (I think) standard configuration at https://git.xeserv.us/xe/anubis-test/src/branch/main/README.... and apparently it doesn't work with Firefox versions before 74 and Chromium versions before 80.

    I wonder if it works with something like Pale Moon.

    replies(1): >>43671163 #
    11. xena ◴[] No.43671163{3}[source]
    It will be sooner if I can get paid enough to be able to quit my day job.