←back to thread

Anubis Works

(xeiaso.net)
319 points evacchi | 1 comments | | HN request time: 0s | source
Show context
throwaway150 ◴[] No.43668638[source]
Looks cool. But please help me understand. What's to stop AI companies from solving the challenge, completing the proof of work and scrape websites anyway?
replies(6): >>43668690 #>>43668774 #>>43668823 #>>43668857 #>>43669150 #>>43670014 #
marginalia_nu ◴[] No.43668823[source]
The problem with scrapers in general is the asymmetry of compute resources involved in generating versus requesting a website. You can likely make millions of HTTP requests with the compute required in generating the average response.

If you make it more expensive to request a documents at scale, you make this type of crawling prohibitively expensive. On a small scale it really doesn't matter, but if you're casting an extremely wide net and re-fetching the same documents hundreds of times, yeah it really does matter. Even if you have a big VC budget.

replies(2): >>43669262 #>>43669530 #
Nathanba ◴[] No.43669530[source]
Yes but the scraper only has to solve it once and it gets cached too right? Surely it gets cached, otherwise it would be too annoying for humans on phones too? I guess it depends on whether scrapers are just simple curl clients or full headless browsers but I seriously doubt that Google tier LLM scrapers rely on site content loading statically without js.
replies(3): >>43669867 #>>43669970 #>>43670258 #
ndiddy ◴[] No.43669970[source]
AI companies have started using a technique to evade rate limits where they will have a swarm of tens of thousands of scraper bots using unique residential IPs all accessing your site at once. It's very obvious in aggregate that you're being scraped, but when it's happening, it's very difficult to identify scraper vs. non-scraper traffic. Each time a page is scraped, it just looks like a new user from a residential IP is loading a given page.

Anubis helps combat this because even if the scrapers upgrade to running automated copies of full-featured web browsers that are capable of solving the challenges (which means it costs them a lot more to scrape than it currently does), their server costs would balloon even further because each time they load a page, it requires them to solve a new challenge. This means they use a ton of CPU and their throughput goes way down. Even if they solve a challenge, they can't share the cookie between bots because the IP address of the requestor is used as part of the challenge.

replies(2): >>43670010 #>>43670088 #
vhcr ◴[] No.43670088[source]
Until someone writes the proof of work code for GPUs and it runs 100x faster and cheaper.
replies(2): >>43670796 #>>43671523 #
runxiyu ◴[] No.43670796[source]
Anubis et al. are also looking into alternative algorithms. There seems to be consensus that SHA-256 PoW is not appropriate
replies(1): >>43671337 #
1. genewitch ◴[] No.43671337{3}[source]
There's lots of other ones but you want hashes that use lots of RAM, stuff like scrypt used to be the go-to but I am sure there are better, now.