(pod.geraspora.de)

770 points ta988 | 1 comments | 30 Dec 24 14:37 UTC | HN request time: 0.327s | source

Show context

andrethegiant ◴[30 Dec 24 20:17 UTC] No.42553029[source]▶

CommonCrawl is supposed to help for this, i.e. crawl once and host the dataset for any interested party to download out of band. However, data can be up to a month stale, and it costs $$ to move the data out of us-east-1.

I’m working on a centralized crawling platform[1] that aims to reduce OP’s problem. A caching layer with ~24h TTL for unauthed content would shield websites from redundant bot traffic while still providing up-to-date content for AI crawlers.

[1] https://crawlspace.dev

replies(2): >>42555844 #>>42563422 #

1. alphan0n ◴[31 Dec 24 02:38 UTC] No.42555844[source]▶

>>42553029 #

Laughably, CommonCrawl shows that the authors robots.txt were configured to allow all, the entire time.

https://pastebin.com/VSHMTThJ

↑

AI companies cause most of traffic on forums