Is there not yet a Source where the web has already been scraped and souped down to just the text? It would seem someone would have created such a thing in order to save LLM training from having to reinvent the wheel.
I understand the web is a dynamic thing but still it would seem to be useful on some level.