That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.
That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.
I started with ML in 1994, I was in a small poor lab - so we didn't have state of the art hardware. On the other hand I think my experience is fairly representative. We worked with data sets on spark workstations that were stored in flat files and had thousands or sometimes tens of thousands of instances. We had problems keeping our data sets on the machines and often archived them to tape.
Data came from very deliberate acquisition processes. For example I remember going to a field exercise with a particular device and directing it's use over a period of days in order to collect the data that would be needed for a machine learning project.
Sometime in the 2000's data started to be generated and collected as "exhaust" from various processes. People and organisations became instrumented in the sense that their daily activities were necessarily captured digitally. For a time this data was latent, people didn't really think about using it in the way that we think about it now, but by about 2010 it was obvious that not only was this data available but we had the processing and data systems to use it effectively.