That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.
That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.
Most methods research went into ways of building beliefs about a domain into models as biases, so that they could be more accurate in practice with less data. (This describes a lot of PGM work). This was partly because there was still a tug of war between CS and traditional statistics communities on ML, and the latter were trained to be obsessive about model specification.
One result was that the models that were practical for production inference were often trained to the point of diminishing returns on their specific tasks. Engineers deploying ML weren't wishing for more training instances, but better data at inference time. Models that could perform more general tasks -- like differentiating 90k object classes rather than just a few -- were barely even on most people's radar.
Perhaps folks at Google or FB at the time have a different perspective. One of the reasons I went ABD in my program was that it felt industry had access to richer data streams than academia. Fei Fei Li's insistence on building an academic computer science career around giant data sets really was ingenius, and even subversive.
Last week Hugging Face released SmolLM v2 1.7B trained on 11T tokens, 3 orders of magnitude more training data for the same number of tokens with almost the same architecture.
So even back in 2019 we can say we were working with a tiny amount of data compared to what is routine now.
I definitely remember that bias about neural nets, to the point of my first grad ML class having us recreate proofs that you should never need more than two hidden layers (one can pick up the thread at [1]). Of all the ideas clunking around in the AI toolbox at the time, I don't really have background on why people felt the need to kill NN with fire.
[1] https://en.wikipedia.org/wiki/Universal_approximation_theore...
The same has always been true. There has never been a stance along the lines of "ah, let's not collect more data - it's not worth it!". It's always been other reasons, typically the lack of resources.
I started with ML in 1994, I was in a small poor lab - so we didn't have state of the art hardware. On the other hand I think my experience is fairly representative. We worked with data sets on spark workstations that were stored in flat files and had thousands or sometimes tens of thousands of instances. We had problems keeping our data sets on the machines and often archived them to tape.
Data came from very deliberate acquisition processes. For example I remember going to a field exercise with a particular device and directing it's use over a period of days in order to collect the data that would be needed for a machine learning project.
Sometime in the 2000's data started to be generated and collected as "exhaust" from various processes. People and organisations became instrumented in the sense that their daily activities were necessarily captured digitally. For a time this data was latent, people didn't really think about using it in the way that we think about it now, but by about 2010 it was obvious that not only was this data available but we had the processing and data systems to use it effectively.
my storage hierarchy goes 1) 1 storage drive 2) 1 server maxed out with the biggest storage drives available 3) 1 rack filled with servers from 2 4) 1 data center filled with racks from 3
My only surprise is how long it took to get to imagenet, but in retrospect, I appreciate that a number of conditions had to be met (much more data, much better algorithms, much faster computers). I also didn't recognize just how poorly MLPs were for sequence modelling, compared to RNNs and transformers.
TinyLlama[1] has been made by an individual on their own last year, training a 1.1B model on 3T tokens with just 16 A100-40G GPUs in 90 days. It was definitely within reach of any funded org in 2019.
In 2022 (IIRC), Google released the Chinchilla paper about the compute-optimal amount of data to train a given model, for a 1B model, the value was determined to be 20B tokens, which again is 3 orders of magnitude below the current state of the art for the same class of model.
Until very recently (the first llama paper IIRC, and people noticing that the 7B model showed no sign of saturation during its already very long training) the ML community vastly underestimated the amount of training data that was needed to make a LLM perform at its potential.
It's a terrible measurement because it's an irrelevant detail about how their data is stored that no one actually knows if your data is being stored in a proprietary cloud except for people that work there on that team.
So while someone could say they used a 10 TiB data set, or 10T parameters, how many "racks" of AWS S3 that is, is not known outside of Amazon.
I haven't invested the time to take the loss function from our paper and implement in a modern framework, but IIUC, I wouldn't need to provide the derivatives manually. That would be a satisfying outcome (indicating I had wasted a lot of effort learning math that simply wasn't necessary, because somebody had automated it better than I could do manually, in a way I can understand more easily).
And whether your data can fit on a single server, single rack, or many racks will drastically affect how you design the infrastructure.