We do this in fine-tuning all the time: see reverse prompting, etc.
You can create inputs for DPO/ORPO synthetically which is a huge one as previously it would require gigantic investments https://arxiv.org/abs/2402.10379
There's also the gemma2 paper has advanced SOTA in distil; on a side-note, there's many reasons for it but vocab_size and good sizes 9b/27b, IMHO it's currently the best model for i.e. Ukrainian. in fact, I prefer it to anything else there's, including the much larger llama's—by a mile! The model is a triumph of synthetic datasets. https://arxiv.org/abs/2408.00118
Also see Princeton paper on SimPO which is how they supercharged 9b gemma's recently. https://arxiv.org/abs/2405.14734
"In particular, we focus our efforts on knowledge distillation (Hinton et al., 2015), which replaces the one-hot vector seen at each token with the distribution of potential next tokens computed from a large model. [...] Concretely, we use a large language model as a teacher to train small models, namely 2B and 9B models, on a quantity of tokens that is more than 50× the compute-optimal quantity predicted by the theory (Hoffmann et al., 2022)."
Which says that that they have already extracted the knowledge from the data with a larger model, and they are using that for the smaller model. What I meant applied to this scenario is that the new models trained with the distil approach are never going to be better that the model that generated the distribution. Of course you can get better with a change of architecture.
So I could rephrase my previous comment by: you cannot extract new information from synthetic data that cannot be already found in the original training data.
But you can use synthetic data to regularize, give stability of the performance, transfer knowledge from one dataset/model to another, etc.
Thanks again for your very appreciated references!
1. LLM is trained on everything 2. LLM classifies everything in training corpus as high / low quality 3. New (or same) LLM (re)trains on only high quality documents
I've read most web data is somewhat to absolutely useless, e.g. pages of stock quotes, and it seems easy for something like GPT-3 to classify that, and classifying it would take what... one extra epoch's worth of computation? And save much more computation downstream by shrinking the size of the training set.