←back to thread

770 points ta988 | 1 comments | | HN request time: 0.327s | source
Show context
andrethegiant ◴[] No.42553029[source]
CommonCrawl is supposed to help for this, i.e. crawl once and host the dataset for any interested party to download out of band. However, data can be up to a month stale, and it costs $$ to move the data out of us-east-1.

I’m working on a centralized crawling platform[1] that aims to reduce OP’s problem. A caching layer with ~24h TTL for unauthed content would shield websites from redundant bot traffic while still providing up-to-date content for AI crawlers.

[1] https://crawlspace.dev

replies(2): >>42555844 #>>42563422 #
1. alphan0n ◴[] No.42555844[source]
Laughably, CommonCrawl shows that the authors robots.txt were configured to allow all, the entire time.

https://pastebin.com/VSHMTThJ