Anubis Works

(xeiaso.net)

Show context

gyomu ◴[12 Apr 25 22:59 UTC] No.43668594[source]▶

If you’re confused about what this is - it’s to prevent AI scraping.

> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums

https://anubis.techaro.lol/docs/design/how-anubis-works

This is pretty cool, I have a project or two that might benefit from it.

replies(2): >>43669511 #>>43671745 #

x3haloed ◴[13 Apr 25 02:17 UTC] No.43669511[source]▶

>>43668594 #

I’ve been wondering to myself for many years now whether the web is for humans or machines. I personally can’t think of a good reason to specifically try to gate bots when it comes to serving content. Trying to post content or trigger actions could obviously be problematic under many circumstances.

But I find that when it comes to simple serving of content, human vs. bot is not usually what you’re trying to filter or block on. As long as a given client is not abusing your systems, then why do you care if the client is a human?

replies(8): >>43669544 #>>43669558 #>>43669572 #>>43670108 #>>43670208 #>>43670880 #>>43671272 #>>43676454 #

t-writescode ◴[13 Apr 25 02:24 UTC] No.43669544[source]▶

>>43669511 #

> I personally can’t think of a good reason to specifically try to gate bots

There's been numerous posts on HN about people getting slammed, to the tune of many, many dollars and terabytes of data from bots, especially LLM scrapers, burning bandwidth and increasing server-running costs.

replies(1): >>43669560 #

1. ronsor ◴[13 Apr 25 02:28 UTC] No.43669560[source]▶

>>43669544 #

I'm genuinely skeptical that those are all real LLM scrapers. For one, a lot of content is in CommonCrawl and AI companies don't want to redo all that work when they can get some WARC files from AWS.

I'm largely suspecting that these are mostly other bots pretending to be LLM scrapers. Does anyone even check if the bots' IP ranges belong to the AI companies?

replies(4): >>43669584 #>>43669780 #>>43669996 #>>43670176 #

2. t-writescode ◴[13 Apr 25 02:34 UTC] No.43669584[source]▶

>>43669560 (TP) #

No matter the source, the result is the same, and these proof of work systems may be something that can help "the little guy" with their hosting bill

replies(1): >>43674775 #

3. anonym29 ◴[13 Apr 25 03:12 UTC] No.43669780[source]▶

>>43669560 (TP) #

>Does anyone even check if the bots' IP ranges belong to the AI companies?

Sounds like a fun project for an AbuseIPDB contributor. Could look for fake Googlebots / Bingbots, etc, too.

4. userbinator ◴[13 Apr 25 04:05 UTC] No.43669996[source]▶

>>43669560 (TP) #

Also suspect those working on "anti-bot" solutions may have a hand in this.

What better way to show the effectiveness of your solution, than to help create the problem in the first place.

replies(1): >>43672784 #

5. 20after4 ◴[13 Apr 25 04:49 UTC] No.43670176[source]▶

>>43669560 (TP) #

For a long time there have been spammers scraping in search of email addresses to spam. There are all kinds of scraper bots with unknown purpose. It's the aggregate of all of them hitting your server, potentially several at the same time.

When I worked at Wikimedia (so ending ~4 years ago) we had several incidents of bots getting lost in a maze of links within our source repository browser (Phabricator) which could account for > 50% of the load on some pretty powerful Phabricator servers (Something like 96 cores, 512GB RAM). This happened despite having those URLs excluded via robots.txt and implementing some rudimentary request throttling. The scrapers were using lots of different IPs simultaneously and they did not seem to respect any kind of sane rate limits. If googlebot and one or two other scrapers hit at the same time it was enough to cause an outage or at least seriously degrade performance.

Eventually we got better at rate limiting and put more URLs behind authentication but it wasn't an ideal situation and would have been quite difficult to deal with had we been much more resource-constrained or less technically capable.

6. zaphar ◴[13 Apr 25 13:49 UTC] No.43672784[source]▶

>>43669996 #

Why? When there are 100s of hopeful AI/LLM scrapers more than willing to do that work for you what possible reason would you have to do that work? The more typical and common human behavior is perfectly capable of explaining this. No reason to reach for some kind of underhanded conspiracy theory when simple incompetence and greed is more than adequate to explain it.

replies(1): >>43676087 #

7. ronsor ◴[13 Apr 25 18:25 UTC] No.43674775[source]▶

>>43669584 #

If a bot claims to be from an AI company, but isn't from the AI company's IP range, then it's lying and its activity is plain abuse. In that case, you shouldn't serve them a proof of work system; you should block them entirely.

replies(1): >>43675023 #

8. thunderfork ◴[13 Apr 25 18:59 UTC] No.43675023{3}[source]▶

>>43674775 #

Blocking abusive actors can be very non-trivial. The proof-of-work system mitigates the amount of effort that needs to be spent identifying and blocking bad actors.

9. userbinator ◴[13 Apr 25 21:44 UTC] No.43676087{3}[source]▶

>>43672784 #

CF hosts websites that sell DDoS services.

Google really wants everyone to use its spyware-embedded browser.

There are tons of other "anti-bot" solutions that don't have a conflict of interest with those goals, yet the ones that become popular all seem to further them instead.

↑