Zamba2-7B

(www.zyphra.com)

282 points dataminer | 5 comments | 14 Oct 24 22:45 UTC | HN request time: 0s | source

Show context

wg0 ◴[14 Oct 24 23:45 UTC] No.41843436[source]▶

If a model was trained in 1837, would it be useful even today? How models would be trained in 2037 when most of the web might be autogenerated on the fly like that cgi-bin era?

replies(2): >>41843481 #>>41843487 #

Etheryte ◴[14 Oct 24 23:51 UTC] No.41843487[source]▶

>>41843436 #

State of the art models aren't trained the same way as the first models were. High quality datasets are both much more valuable and more useful than simply feeding everything you could possibly crawl into it. Throwing in the kitchen sink and then some is a great way to burn money while also hurting your model accuracy.

replies(2): >>41843784 #>>41845190 #

zeroq ◴[15 Oct 24 00:45 UTC] No.41843784[source]▶

>>41843487 #

I don't follow the hype to close, but I guess the early models were trained on data that was classified by underpaid 3rd world workers en masse. Today you could use your yesterdays model to classify the data for you and build from that. Heck, you can even create a synthetic data with current tech.

replies(1): >>41844136 #

youoy ◴[15 Oct 24 01:49 UTC] No.41844136[source]▶

>>41843784 #

The quality of your model is going to match at best the quality of the data. If you use yesterday's model to label data/create a synthetic dataset, then the new model built on top of it cannot go beyond that. If it can, then it can also do it (and better) with the data that trained yesterday's model.

replies(1): >>41845356 #

tucnak ◴[15 Oct 24 05:56 UTC] No.41845356{3}[source]▶

>>41844136 #

This is not an accurate assessment; the forward-pass is nontrivial, i.e. you're always adding new information. When they say "synthetic" datasets, nobody is suggesting that the past model is used to invent it completely. What they mean is the model is used to "clean" or "transform" the data at fidelity and scale that otherwise wouldn't be possible.

We do this in fine-tuning all the time: see reverse prompting, etc.

replies(1): >>41845650 #

1. youoy ◴[15 Oct 24 06:45 UTC] No.41845650{4}[source]▶

>>41845356 #

My bad then, I have not seen it done successfully yet. Do you happen to have some references at hand? I would be more than grateful! Thanks in advance!

replies(1): >>41846153 #

2. tucnak ◴[15 Oct 24 08:12 UTC] No.41846153[source]▶

>>41845650 (TP) #

The LIMA paper, I think, would be a good place to start https://arxiv.org/abs/2305.11206

You can create inputs for DPO/ORPO synthetically which is a huge one as previously it would require gigantic investments https://arxiv.org/abs/2402.10379

There's also the gemma2 paper has advanced SOTA in distil; on a side-note, there's many reasons for it but vocab_size and good sizes 9b/27b, IMHO it's currently the best model for i.e. Ukrainian. in fact, I prefer it to anything else there's, including the much larger llama's—by a mile! The model is a triumph of synthetic datasets. https://arxiv.org/abs/2408.00118

Also see Princeton paper on SimPO which is how they supercharged 9b gemma's recently. https://arxiv.org/abs/2405.14734

replies(2): >>41846648 #>>41848233 #

3. youoy ◴[15 Oct 24 09:21 UTC] No.41846648[source]▶

>>41846153 #

Thanks for the answer! I feel that we can meet in the middle. For example, the distil paper says:

"In particular, we focus our efforts on knowledge distillation (Hinton et al., 2015), which replaces the one-hot vector seen at each token with the distribution of potential next tokens computed from a large model. [...] Concretely, we use a large language model as a teacher to train small models, namely 2B and 9B models, on a quantity of tokens that is more than 50× the compute-optimal quantity predicted by the theory (Hoffmann et al., 2022)."

Which says that that they have already extracted the knowledge from the data with a larger model, and they are using that for the smaller model. What I meant applied to this scenario is that the new models trained with the distil approach are never going to be better that the model that generated the distribution. Of course you can get better with a change of architecture.

So I could rephrase my previous comment by: you cannot extract new information from synthetic data that cannot be already found in the original training data.

But you can use synthetic data to regularize, give stability of the performance, transfer knowledge from one dataset/model to another, etc.

Thanks again for your very appreciated references!

replies(1): >>41848496 #

4. stormfather ◴[15 Oct 24 13:13 UTC] No.41848233[source]▶

>>41846153 #

Do they ever do something like the following scenario?:

1. LLM is trained on everything 2. LLM classifies everything in training corpus as high / low quality 3. New (or same) LLM (re)trains on only high quality documents

I've read most web data is somewhat to absolutely useless, e.g. pages of stock quotes, and it seems easy for something like GPT-3 to classify that, and classifying it would take what... one extra epoch's worth of computation? And save much more computation downstream by shrinking the size of the training set.

5. tucnak ◴[15 Oct 24 13:40 UTC] No.41848496{3}[source]▶

>>41846648 #

Regularise is a really good choice of word :-)

↑