←back to thread

Anubis Works

(xeiaso.net)
313 points evacchi | 1 comments | | HN request time: 0.204s | source
Show context
perching_aix ◴[] No.43668684[source]
> Sadly, you must enable JavaScript to get past this challenge. This is required because AI companies have changed the social contract around how website hosting works. A no-JS solution is a work-in-progress.

Will be interested to hear of that. In the meantime, at least I learned of JShelter.

Edit:

Why not use the passage of time as the limiter? I guess it would still require JS though, unless there's some hack possible with CSS animations, like request an image with certain URL params only after an animation finishes.

This does remind me how all of these additional hoops are making web browsing slow.

Edit #2:

Thinking even more about it, time could be made a hurdle by just.. slowly serving incoming requests. No fancy timestamp signing + CSS animations or whatever trickery required.

I'm also not sure if time would make at-scale scraping as much more expensive as PoW does. Time is money, sure, but that much? Also, the UX of it I'm not sold on, but could be mitigated somewhat by doing news website style "I'm only serving the first 20% of my content initially" stuff.

So yeah, will be curious to hear the non-JS solution. The easy way out would be a browser extension, but then it's not really non-JS, just JS compartmentalized, isn't it?

Edit #3:

Turning reasoning on for a moment, this whole thing is a bit iffy.

First of all, the goal is that a website operator would be able to control the use of information they disseminate to the general public via their website, such that it won't be used specifically for AI training. In principle, this is nonsensical. The goal of sharing information with the general public (so, people) involves said information eventually traversing through a non-technological medium (air, as light), to reach a non-technological entity (a person). This means that any technological measure will be limited to before that medium, and won't be able to affect said target either. Put differently, I can rote copy your website out into a text editor, or hold up a camera with OCR and scan the screen, if scale is needed.

So in principle we're definitely hosed, but in practice you can try to hold onto the modality of "scraping for AI training" by leveraging the various technological fingerprints of such activity, which is how we get to at-scale PoW. But then this also combats any other kind of at-scale scraping, such as search engines. You could whitelist specific search engines, but then you're engaging in anti-competitive measures, since smaller third party search engines now have to magically get themselves on your list. And even if they do, they might be lying about being just a search engine, because e.g. Google may scrape your website for search, but will 100% use it for AI training then too.

So I don't really see any technological modality that would be able properly discriminate AI training purposed scraping traffic for you to use PoW or other methods against. You may decide to engage in this regardless based on statistical data, and just live with the negative aspects of your efforts, but then it's a bit iffy.

Finally, what about the energy consumption shaped elephant in the room? Using PoW for this is going basically exactly against the spirit of wanting less energy to be spent on AI and co. That said, this may not be a goal for the author.

The more I think about this, the less sensible and agreeable it is. I don't know man.

replies(3): >>43668835 #>>43668958 #>>43668998 #
1. abetusk ◴[] No.43668998[source]
Time delay, as you proposed, is easily defeated by concurrent connections. In some sense, you're sacrificing latency without sacrificing throughput.

A bot network can make many connections at once, waiting until the timeout to get the entirety of their (multiple) request(s). Every serial delay you put in is a minor inconvenience to a bot network, since they're automated anyway, but a degrading experience for good faith use.

Time delay solutions get worse for services like posting, account creation, etc. as they're sidestepped by concurrent connections that can wait out the delay to then flood the server.

Requiring proof-of-work costs the agent something in terms of resources. The proof-of-work certificate allows for easy verification (in terms of compute resources) relative to the amount of work to find the certificate in the first place.

A small resource tax on agents has minimal effect on everyday use but has compounding effect for bots, as any bot crawl now needs resources that scale linearly with the number of pages that it requests. Without proof-of-work, the limiting resource for bots is network bandwidth, as processing page data is effectively free relative to bandwidth costs. By requiring work/energy expenditure to requests, bots now have a compute as a bottleneck.

As an analogy, consider if sending an email would cost $0.01. For most people, the number of emails sent over the course of a year could easily cost them less than $20.00, but for spam bots that send email blasts of up to 10k recipients, this now would cost them $100.00 per shot. The tax on individual users is minimal but is significant enough so that mass spam efforts are strained.

It doesn't prevent spam, or bots, entirely, but the point is to provide some friction that's relatively transparent to end users while mitigating abusive use.