←back to thread

56 points trott | 2 comments | | HN request time: 0.421s | source
1. bambax ◴[] No.40715513[source]
The paradox is that the amount of data available for LLM training is going down, not up, because earlier models made ample use of copyrighted works that later models won't have access to.
replies(1): >>40717153 #
2. LoganDark ◴[] No.40717153[source]
Not only that, but a dataset that includes LLM-generated content has been known to reduce model quality. I remember there being a paper on it but I can't seem to find it now. Essentially, the internet now being chock full of LLM garbage means that any model you train on it is going to end up quite a bit worse than it could have been, simply because of the dataset being "poisoned" by preexisting LLMs. I bet OpenAI's only real advantage is having a dataset that was gathered before LLM use was widespread.