(pod.geraspora.de)

770 points ta988 | 1 comments | 30 Dec 24 14:37 UTC | HN request time: 0.204s | source

Show context

andrethegiant ◴[30 Dec 24 20:17 UTC] No.42553029[source]▶

CommonCrawl is supposed to help for this, i.e. crawl once and host the dataset for any interested party to download out of band. However, data can be up to a month stale, and it costs $$ to move the data out of us-east-1.

I’m working on a centralized crawling platform[1] that aims to reduce OP’s problem. A caching layer with ~24h TTL for unauthed content would shield websites from redundant bot traffic while still providing up-to-date content for AI crawlers.

[1] https://crawlspace.dev

replies(2): >>42555844 #>>42563422 #

1. Smerity ◴[01 Jan 25 02:05 UTC] No.42563422[source]▶

>>42553029 #

You can download Common Crawl data for free using HTTPS with no credentials. If you don't store it (streamed processing or equivalent) and you have no cost for incoming data (which most clouds don't) you're good!

You can do so by adding `https://data.commoncrawl.org/` instead of `s3://commoncrawl/` before each of the WARC/WAT/WET paths.

↑

AI companies cause most of traffic on forums