←back to thread

262 points rain1 | 4 comments | | HN request time: 0.978s | source
Show context
simonw ◴[] No.44443287[source]
> There were projects to try to match it, but generally they operated by fine tuning things like small (70B) llama models on a bunch of GPT-3 generated texts (synthetic data - which can result in degeneration when AI outputs are fed back into AI training inputs).

That parenthetical doesn't quite work for me.

If synthetic data always degraded performance, AI labs wouldn't use synthetic data. They use it because it helps them train better models.

There's a paper that shows that if you very deliberately train a model in its own output in a loop you can get worse performance. That's not what AI labs using synthetic data actually do.

That paper gets a lot of attention because the schadenfreude of models destroying themselves through eating their own tails is irresistible.

replies(1): >>44443471 #
1. rybosome ◴[] No.44443471[source]
Agreed, especially when in this context of training a smaller model on a larger model’s outputs. Distillation is generally accepted as an effective technique.

This is exactly what I did in a previous role, fine-tuning Llama and Mistral models on a mix of human and GPT-4 data for a domain-specific task. Adding (good) synthetic data definitely increased the output quality for our tasks.

replies(1): >>44444658 #
2. rain1 ◴[] No.44444658[source]
Yes but just purely in terms of entropy, you can't make a model better than GPT-4 by training it on GPT-4 outputs. The limit you would converge towards is GPT-4.
replies(2): >>44445071 #>>44445255 #
3. ◴[] No.44445071[source]
4. simonw ◴[] No.44445255[source]
A better way to think about synthetic data is to consider code. With code you can have an LLM generate code with tests, then confirm that the code compiles and the tests pass. Now you have semi-verified new code you can add to your training data, and training on that will help you get better results for code even though it was generated by a "less good" LLM.