Anubis Works

(xeiaso.net)

313 points evacchi | 1 comments | 12 Apr 25 22:32 UTC | HN request time: 0.203s | source

Show context

throwaway150 ◴[12 Apr 25 23:06 UTC] No.43668638[source]▶

Looks cool. But please help me understand. What's to stop AI companies from solving the challenge, completing the proof of work and scrape websites anyway?

replies(6): >>43668690 #>>43668774 #>>43668823 #>>43668857 #>>43669150 #>>43670014 #

marginalia_nu ◴[12 Apr 25 23:41 UTC] No.43668823[source]▶

>>43668638 #

The problem with scrapers in general is the asymmetry of compute resources involved in generating versus requesting a website. You can likely make millions of HTTP requests with the compute required in generating the average response.

If you make it more expensive to request a documents at scale, you make this type of crawling prohibitively expensive. On a small scale it really doesn't matter, but if you're casting an extremely wide net and re-fetching the same documents hundreds of times, yeah it really does matter. Even if you have a big VC budget.

replies(2): >>43669262 #>>43669530 #

Nathanba ◴[13 Apr 25 02:22 UTC] No.43669530[source]▶

>>43668823 #

Yes but the scraper only has to solve it once and it gets cached too right? Surely it gets cached, otherwise it would be too annoying for humans on phones too? I guess it depends on whether scrapers are just simple curl clients or full headless browsers but I seriously doubt that Google tier LLM scrapers rely on site content loading statically without js.

replies(3): >>43669867 #>>43669970 #>>43670258 #

1. FridgeSeal ◴[13 Apr 25 05:14 UTC] No.43670258[source]▶

>>43669530 #

It seems a good chunk of the issue with these modern LLM scrapers is that they are doing _none_ of the normal “sane” things. Caching content, respecting rate limits, using sitemaps, bothering to track explore depth properly, etc.

↑