> There were projects to try to match it, but generally they operated by fine tuning things like small (70B) llama models on a bunch of GPT-3 generated texts (synthetic data - which can result in degeneration when AI outputs are fed back into AI training inputs).
That parenthetical doesn't quite work for me.
If synthetic data always degraded performance, AI labs wouldn't use synthetic data. They use it because it helps them train better models.
There's a paper that shows that if you very deliberately train a model in its own output in a loop you can get worse performance. That's not what AI labs using synthetic data actually do.
That paper gets a lot of attention because the schadenfreude of models destroying themselves through eating their own tails is irresistible.
replies(1):