Zamba2-7B

(www.zyphra.com)

282 points dataminer | 1 comments | 14 Oct 24 22:45 UTC | HN request time: 0.207s | source

Show context

wg0 ◴[14 Oct 24 23:45 UTC] No.41843436[source]▶

If a model was trained in 1837, would it be useful even today? How models would be trained in 2037 when most of the web might be autogenerated on the fly like that cgi-bin era?

replies(2): >>41843481 #>>41843487 #

Etheryte ◴[14 Oct 24 23:51 UTC] No.41843487[source]▶

>>41843436 #

State of the art models aren't trained the same way as the first models were. High quality datasets are both much more valuable and more useful than simply feeding everything you could possibly crawl into it. Throwing in the kitchen sink and then some is a great way to burn money while also hurting your model accuracy.

replies(2): >>41843784 #>>41845190 #

zeroq ◴[15 Oct 24 00:45 UTC] No.41843784[source]▶

>>41843487 #

I don't follow the hype to close, but I guess the early models were trained on data that was classified by underpaid 3rd world workers en masse. Today you could use your yesterdays model to classify the data for you and build from that. Heck, you can even create a synthetic data with current tech.

replies(1): >>41844136 #

youoy ◴[15 Oct 24 01:49 UTC] No.41844136[source]▶

>>41843784 #

The quality of your model is going to match at best the quality of the data. If you use yesterday's model to label data/create a synthetic dataset, then the new model built on top of it cannot go beyond that. If it can, then it can also do it (and better) with the data that trained yesterday's model.

replies(1): >>41845356 #

tucnak ◴[15 Oct 24 05:56 UTC] No.41845356[source]▶

>>41844136 #

This is not an accurate assessment; the forward-pass is nontrivial, i.e. you're always adding new information. When they say "synthetic" datasets, nobody is suggesting that the past model is used to invent it completely. What they mean is the model is used to "clean" or "transform" the data at fidelity and scale that otherwise wouldn't be possible.

We do this in fine-tuning all the time: see reverse prompting, etc.

replies(1): >>41845650 #

youoy ◴[15 Oct 24 06:45 UTC] No.41845650[source]▶

>>41845356 #

My bad then, I have not seen it done successfully yet. Do you happen to have some references at hand? I would be more than grateful! Thanks in advance!

replies(1): >>41846153 #

tucnak ◴[15 Oct 24 08:12 UTC] No.41846153[source]▶

>>41845650 #

The LIMA paper, I think, would be a good place to start https://arxiv.org/abs/2305.11206

You can create inputs for DPO/ORPO synthetically which is a huge one as previously it would require gigantic investments https://arxiv.org/abs/2402.10379

There's also the gemma2 paper has advanced SOTA in distil; on a side-note, there's many reasons for it but vocab_size and good sizes 9b/27b, IMHO it's currently the best model for i.e. Ukrainian. in fact, I prefer it to anything else there's, including the much larger llama's—by a mile! The model is a triumph of synthetic datasets. https://arxiv.org/abs/2408.00118

Also see Princeton paper on SimPO which is how they supercharged 9b gemma's recently. https://arxiv.org/abs/2405.14734

replies(2): >>41846648 #>>41848233 #

1. stormfather ◴[15 Oct 24 13:13 UTC] No.41848233[source]▶

>>41846153 #

Do they ever do something like the following scenario?:

1. LLM is trained on everything 2. LLM classifies everything in training corpus as high / low quality 3. New (or same) LLM (re)trains on only high quality documents

I've read most web data is somewhat to absolutely useless, e.g. pages of stock quotes, and it seems easy for something like GPT-3 to classify that, and classifying it would take what... one extra epoch's worth of computation? And save much more computation downstream by shrinking the size of the training set.

↑