Most active commenters

marginalia_nu(3)

Popular/hot comments

>>43669530 #

←back to thread

Anubis Works

(xeiaso.net)

Show context

throwaway150 ◴[12 Apr 25 23:06 UTC] No.43668638[source]▶

>>43668433 (OP) #

Looks cool. But please help me understand. What's to stop AI companies from solving the challenge, completing the proof of work and scrape websites anyway?

replies(6): >>43668690 #>>43668774 #>>43668823 #>>43668857 #>>43669150 #>>43670014 #

1. marginalia_nu ◴[12 Apr 25 23:41 UTC] No.43668823[source]▶

>>43668638 #

The problem with scrapers in general is the asymmetry of compute resources involved in generating versus requesting a website. You can likely make millions of HTTP requests with the compute required in generating the average response.

If you make it more expensive to request a documents at scale, you make this type of crawling prohibitively expensive. On a small scale it really doesn't matter, but if you're casting an extremely wide net and re-fetching the same documents hundreds of times, yeah it really does matter. Even if you have a big VC budget.

replies(2): >>43669262 #>>43669530 #

2. charcircuit ◴[13 Apr 25 01:24 UTC] No.43669262[source]▶

>>43668823 (TP) #

If you make it prohibitively expensive almost no regular user will want to wait for it.

replies(2): >>43669428 #>>43669586 #

3. bobmcnamara ◴[13 Apr 25 01:59 UTC] No.43669428[source]▶

>>43669262 #

Exponential backoff!

4. Nathanba ◴[13 Apr 25 02:22 UTC] No.43669530[source]▶

>>43668823 (TP) #

Yes but the scraper only has to solve it once and it gets cached too right? Surely it gets cached, otherwise it would be too annoying for humans on phones too? I guess it depends on whether scrapers are just simple curl clients or full headless browsers but I seriously doubt that Google tier LLM scrapers rely on site content loading statically without js.

replies(3): >>43669867 #>>43669970 #>>43670258 #

5. xboxnolifes ◴[13 Apr 25 02:34 UTC] No.43669586[source]▶

>>43669262 #

Regular users usually aren't page hopping 10 pages per second. A regular user is usually 100 times less than that.

replies(1): >>43669917 #

6. Hakkin ◴[13 Apr 25 03:28 UTC] No.43669867[source]▶

>>43669530 #

It sets a cookie with a JWT verifying you completed the proof-of-work along with metadata about the origin of the request, the cookie is valid for a week. This is as far as Anubis goes, once you have this cookie you can do whatever you want on the site. For now it seems like enough to stop a decent portion of web crawlers.

You can do more underneath Anubis using the JWT as a sort of session token though, like rate limiting on a per proof-of-work basis, if a client using X token makes more than Y requests in a period of time, invalidate the token and force them to generate a new one. This would force them to either crawl slowly or use many times more resources to crawl your content.

7. pabs3 ◴[13 Apr 25 03:38 UTC] No.43669917{3}[source]▶

>>43669586 #

I tend to get blocked by HN when opening lots of comment pages in tabs with Ctrl+click.

replies(1): >>43670507 #

8. ndiddy ◴[13 Apr 25 03:56 UTC] No.43669970[source]▶

>>43669530 #

AI companies have started using a technique to evade rate limits where they will have a swarm of tens of thousands of scraper bots using unique residential IPs all accessing your site at once. It's very obvious in aggregate that you're being scraped, but when it's happening, it's very difficult to identify scraper vs. non-scraper traffic. Each time a page is scraped, it just looks like a new user from a residential IP is loading a given page.

Anubis helps combat this because even if the scrapers upgrade to running automated copies of full-featured web browsers that are capable of solving the challenges (which means it costs them a lot more to scrape than it currently does), their server costs would balloon even further because each time they load a page, it requires them to solve a new challenge. This means they use a ton of CPU and their throughput goes way down. Even if they solve a challenge, they can't share the cookie between bots because the IP address of the requestor is used as part of the challenge.

replies(2): >>43670010 #>>43670088 #

9. Nathanba ◴[13 Apr 25 04:09 UTC] No.43670010{3}[source]▶

>>43669970 #

Tens of thousands of scraper bots for a single site? Is that really the case? I would have assumed that maybe 3-5 bots send lets say 20 requests per second in parallel to scrape. Sure, they might eventually start trying different ips and bots if their others are timing out but ultimately it's still the same end result: All they will realize is that they have to increase the timeout and use headless browsers to cache results and the entire protection is gone. But yes, I think for big bot farms it will be a somewhat annoying cost increase to do this. This should really be combined with the cloudflare captcha to make it even more effective.

replies(2): >>43670207 #>>43671130 #

10. vhcr ◴[13 Apr 25 04:25 UTC] No.43670088{3}[source]▶

>>43669970 #

Until someone writes the proof of work code for GPUs and it runs 100x faster and cheaper.

replies(2): >>43670796 #>>43671523 #

11. Hasnep ◴[13 Apr 25 04:59 UTC] No.43670207{4}[source]▶

>>43670010 #

If you're sending 20 requests per second from one IP address you'll hit rate limits quickly, that's why they're using botnets to DDoS these websites.

12. FridgeSeal ◴[13 Apr 25 05:14 UTC] No.43670258[source]▶

>>43669530 #

It seems a good chunk of the issue with these modern LLM scrapers is that they are doing _none_ of the normal “sane” things. Caching content, respecting rate limits, using sitemaps, bothering to track explore depth properly, etc.

13. xboxnolifes ◴[13 Apr 25 06:17 UTC] No.43670507{4}[source]▶

>>43669917 #

Yes, HN has a fairly strict slow down policy for commenting. But, that's irrelevant to the context.

replies(1): >>43672457 #

14. runxiyu ◴[13 Apr 25 07:25 UTC] No.43670796{4}[source]▶

>>43670088 #

Anubis et al. are also looking into alternative algorithms. There seems to be consensus that SHA-256 PoW is not appropriate

replies(1): >>43671337 #

15. marginalia_nu ◴[13 Apr 25 08:36 UTC] No.43671130{4}[source]▶

>>43670010 #

A lot of the worst offenders seem to be routing the traffic through a residential botnet, which means that the traffic really does come from a huge number of different origins. It's really janky and often the same resources are fetched multiple times.

Saving and re-using the JWT cookie isn't that helpful, as you can effectively rate limit using the cookie as identity, so to reach the same request rates you see now they'd still need to solve hundreds or thousands of challenges per domain.

16. genewitch ◴[13 Apr 25 09:17 UTC] No.43671337{5}[source]▶

>>43670796 #

There's lots of other ones but you want hashes that use lots of RAM, stuff like scrypt used to be the go-to but I am sure there are better, now.

17. marginalia_nu ◴[13 Apr 25 09:50 UTC] No.43671523{4}[source]▶

>>43670088 #

A big part of the problem with these scraping operations is how poorly implemented they are. They can get a lot cheaper gains by simply cleaning up how they operate, to not redundantly fetch the same documents hundreds of times, and so on.

Regardless of how they solve the challenges, creating an incentive to be efficient is a victory in itself. GPUs aren't cheap either, especially not if you're renting them via a browser farm.

18. pabs3 ◴[13 Apr 25 12:55 UTC] No.43672457{5}[source]▶

>>43670507 #

I meant to say article pages not comment pages, but ack.

↑