←back to thread

514 points andy99 | 2 comments | | HN request time: 0.52s | source
Show context
k__ ◴[] No.44536047[source]
"respecting web crawling opt-outs during data acquisition produces virtually no performance degradation"

Great to read that!

replies(3): >>44536377 #>>44538760 #>>44539036 #
1. JKCalhoun ◴[] No.44538760[source]
Is there not yet a Source where the web has already been scraped and souped down to just the text? It would seem someone would have created such a thing in order to save LLM training from having to reinvent the wheel.

I understand the web is a dynamic thing but still it would seem to be useful on some level.

replies(1): >>44540972 #
2. CaptainFever ◴[] No.44540972[source]
Common Crawl, maybe?