We do this in fine-tuning all the time: see reverse prompting, etc.
You can create inputs for DPO/ORPO synthetically which is a huge one as previously it would require gigantic investments https://arxiv.org/abs/2402.10379
There's also the gemma2 paper has advanced SOTA in distil; on a side-note, there's many reasons for it but vocab_size and good sizes 9b/27b, IMHO it's currently the best model for i.e. Ukrainian. in fact, I prefer it to anything else there's, including the much larger llama's—by a mile! The model is a triumph of synthetic datasets. https://arxiv.org/abs/2408.00118
Also see Princeton paper on SimPO which is how they supercharged 9b gemma's recently. https://arxiv.org/abs/2405.14734
1. LLM is trained on everything 2. LLM classifies everything in training corpus as high / low quality 3. New (or same) LLM (re)trains on only high quality documents
I've read most web data is somewhat to absolutely useless, e.g. pages of stock quotes, and it seems easy for something like GPT-3 to classify that, and classifying it would take what... one extra epoch's worth of computation? And save much more computation downstream by shrinking the size of the training set.