Not only that, but a dataset that includes LLM-generated content has been known to reduce model quality. I remember there being a paper on it but I can't seem to find it now. Essentially, the internet now being chock full of LLM garbage means that any model you train on it is going to end up quite a bit worse than it could have been, simply because of the dataset being "poisoned" by preexisting LLMs. I bet OpenAI's only real advantage is having a dataset that was gathered before LLM use was widespread.