←back to thread

251 points slyall | 2 comments | | HN request time: 0s | source
Show context
kleiba ◴[] No.42061089[source]
> “Pre-ImageNet, people did not believe in data,” Li said in a September interview at the Computer History Museum. “Everyone was working on completely different paradigms in AI with a tiny bit of data.”

That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.

replies(6): >>42061617 #>>42061818 #>>42061987 #>>42063019 #>>42063076 #>>42064875 #
littlestymaar ◴[] No.42061987[source]
In 2019, GPT-2 1.5B was trained on ~10B tokens.

Last week Hugging Face released SmolLM v2 1.7B trained on 11T tokens, 3 orders of magnitude more training data for the same number of tokens with almost the same architecture.

So even back in 2019 we can say we were working with a tiny amount of data compared to what is routine now.

replies(1): >>42063083 #
1. kleiba ◴[] No.42063083[source]
True. But my point is that the quote "people didn't believe in data" is not true. Back in 2019, when GPT-2 was trained, the reason they didn't use the 3T of today was not because they "didn't believe in data" - they totally would have had it been technically feasible (as in: they had that much data + the necessary compute).

The same has always been true. There has never been a stance along the lines of "ah, let's not collect more data - it's not worth it!". It's always been other reasons, typically the lack of resources.

replies(1): >>42066238 #
2. littlestymaar ◴[] No.42066238[source]
> they totally would have had it been technically feasible

TinyLlama[1] has been made by an individual on their own last year, training a 1.1B model on 3T tokens with just 16 A100-40G GPUs in 90 days. It was definitely within reach of any funded org in 2019.

In 2022 (IIRC), Google released the Chinchilla paper about the compute-optimal amount of data to train a given model, for a 1B model, the value was determined to be 20B tokens, which again is 3 orders of magnitude below the current state of the art for the same class of model.

Until very recently (the first llama paper IIRC, and people noticing that the 7B model showed no sign of saturation during its already very long training) the ML community vastly underestimated the amount of training data that was needed to make a LLM perform at its potential.

[1]: https://github.com/jzhang38/TinyLlama