Most active commenters

UltraSane(4)
kleiba(3)
mistrial9(3)
sgt101(3)
dekhn(3)

Popular/hot comments

>>42061818 #

←back to thread

The deep learning boom caught almost everyone by surprise

(www.understandingai.org)

1. kleiba ◴[06 Nov 24 12:23 UTC] No.42061089[source]▶

>>42057139 (OP) #

> “Pre-ImageNet, people did not believe in data,” Li said in a September interview at the Computer History Museum. “Everyone was working on completely different paradigms in AI with a tiny bit of data.”

That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.

replies(7): >>42061617 #>>42061818 #>>42061987 #>>42063019 #>>42063076 #>>42064875 #>>42189460 #

2. FrustratedMonky ◴[06 Nov 24 13:01 UTC] No.42061617[source]▶

>>42061089 (TP) #

Not really. This is referring back to the 80's. People weren't even doing 'ML'. And back then people were more focused on teasing out 'laws' in as few data points as possible. The focus was more on formulas and symbols, and finding relationships between individual data points. Not the broad patterns we take for granted today.

replies(2): >>42062250 #>>42063993 #

3. evrydayhustling ◴[06 Nov 24 13:17 UTC] No.42061818[source]▶

>>42061089 (TP) #

Not baloney. The culture around data in 2005-2010 -- at least / especially in academia -- was night and day to where it is today. It's not that people didn't understand that more data enabled richer + more accurate models, but that they accepted data constraints as a part of the problem setup.

Most methods research went into ways of building beliefs about a domain into models as biases, so that they could be more accurate in practice with less data. (This describes a lot of PGM work). This was partly because there was still a tug of war between CS and traditional statistics communities on ML, and the latter were trained to be obsessive about model specification.

One result was that the models that were practical for production inference were often trained to the point of diminishing returns on their specific tasks. Engineers deploying ML weren't wishing for more training instances, but better data at inference time. Models that could perform more general tasks -- like differentiating 90k object classes rather than just a few -- were barely even on most people's radar.

Perhaps folks at Google or FB at the time have a different perspective. One of the reasons I went ABD in my program was that it felt industry had access to richer data streams than academia. Fei Fei Li's insistence on building an academic computer science career around giant data sets really was ingenius, and even subversive.

replies(3): >>42062715 #>>42063187 #>>42075640 #

4. littlestymaar ◴[06 Nov 24 13:31 UTC] No.42061987[source]▶

>>42061089 (TP) #

In 2019, GPT-2 1.5B was trained on ~10B tokens.

Last week Hugging Face released SmolLM v2 1.7B trained on 11T tokens, 3 orders of magnitude more training data for the same number of tokens with almost the same architecture.

So even back in 2019 we can say we were working with a tiny amount of data compared to what is routine now.

replies(1): >>42063083 #

5. criddell ◴[06 Nov 24 13:50 UTC] No.42062250[source]▶

>>42061617 #

I would say using backpropagation to train multi-layer neural networks would qualify as ML and we were definitely doing that in 80's.

replies(1): >>42062594 #

6. UltraSane ◴[06 Nov 24 14:14 UTC] No.42062594{3}[source]▶

>>42062250 #

Just with tiny amounts of data.

replies(1): >>42062627 #

7. jensgk ◴[06 Nov 24 14:17 UTC] No.42062627{4}[source]▶

>>42062594 #

Compared to today. We thought we used large amounts of data at the time.

replies(1): >>42062803 #

8. bsenftner ◴[06 Nov 24 14:24 UTC] No.42062715[source]▶

>>42061818 #

The culture was and is skeptical in biased manners. Between '04 and '08 I worked with a group that had trained neural nets for 3D reconstruction of human heads. They were using it for prenatal diagnostics and a facial recognition pre-processor, and I was using it for creating digital doubles in VFX film making. By '08 I'd developed a system suitable for use in mobile advertising, creating ads with people in them, and 3D games with your likeness as the player. VCs thought we were frauds, and their tech advisors told them our tech was an old discredited technique that could not do what we claimed. We spoke to every VC, some of which literally kicked us out. Finally, after years of "no" that same AlexNet success begins to change minds, but now they want the tech to create porn. At that point, after years of "no" I was making children's educational media, there was no way I was gonna do porn. Plus, president of my co was a woman, famous for creating children's media. Yeah, the culture was different then, not too long ago.

replies(2): >>42062832 #>>42066509 #

9. UltraSane ◴[06 Nov 24 14:29 UTC] No.42062803{5}[source]▶

>>42062627 #

"We thought we used large amounts of data at the time."

Really? Did it take at least an entire rack to store?

replies(1): >>42063257 #

10. evrydayhustling ◴[06 Nov 24 14:31 UTC] No.42062832{3}[source]▶

>>42062715 #

Wow, so early for generative -- although I assume you were generating parameters that got mapped to mesh positions, rather than generating pixels?

I definitely remember that bias about neural nets, to the point of my first grad ML class having us recreate proofs that you should never need more than two hidden layers (one can pick up the thread at [1]). Of all the ideas clunking around in the AI toolbox at the time, I don't really have background on why people felt the need to kill NN with fire.

[1] https://en.wikipedia.org/wiki/Universal_approximation_theore...

replies(1): >>42064437 #

11. kleiba ◴[06 Nov 24 14:43 UTC] No.42063019[source]▶

>>42061089 (TP) #

Answering to people arguing against my comment: you guys do not seem to take into account that the technical circumstances were totally different thirty, twenty or even ten years ago! People would have liked to train with more data, and there was a big interest in combining heterogeneous datasets to achieve exactly that. But one major problem was the compute! There weren't any pretrained models that you specialized in one way or the other - you always retrained from scratch. I mean, even today, who's get the capability to train a multibillion GPT from scratch? And not just retraining once a tried and trusted architecture+dataset, no, I mean as a research project trying to optimize your setup towards a certain goal.

12. kccqzy ◴[06 Nov 24 14:46 UTC] No.42063076[source]▶

>>42061089 (TP) #

Pre-ImageNet was like pre-2010. Doing ML with massive data really wasn't in vogue back then.

replies(1): >>42064389 #

13. kleiba ◴[06 Nov 24 14:46 UTC] No.42063083[source]▶

>>42061987 #

True. But my point is that the quote "people didn't believe in data" is not true. Back in 2019, when GPT-2 was trained, the reason they didn't use the 3T of today was not because they "didn't believe in data" - they totally would have had it been technically feasible (as in: they had that much data + the necessary compute).

The same has always been true. There has never been a stance along the lines of "ah, let's not collect more data - it's not worth it!". It's always been other reasons, typically the lack of resources.

replies(1): >>42066238 #

14. tucnak ◴[06 Nov 24 14:52 UTC] No.42063187[source]▶

>>42061818 #

> they accepted data constraints as a part of the problem setup.

I've never heard this be put so succinctly! Thank you

15. jensgk ◴[06 Nov 24 14:56 UTC] No.42063257{6}[source]▶

>>42062803 #

We didn't measure data size that way. At some point in the future someone would find this dialog, and think that we dont't have large amounts of data now, because we are not using entire solar systems for storage.

replies(1): >>42065235 #

16. mistrial9 ◴[06 Nov 24 15:42 UTC] No.42063993[source]▶

>>42061617 #

mid-90s had neural nets, even a few popular science kinds of books on it. The common hardware was so much less capable then.

replies(2): >>42064954 #>>42074497 #

17. mistrial9 ◴[06 Nov 24 16:06 UTC] No.42064389[source]▶

>>42063076 #

except in Ivory Towers of Google + Facebook

replies(1): >>42066977 #

18. bsenftner ◴[06 Nov 24 16:10 UTC] No.42064437{4}[source]▶

>>42062832 #

It was annotated face images and 3D scans of heads trained to map one to the other. After a threshold in the size of the training data, good to great results from a single photo could be had to generate the mesh 3D positions, and then again to map the photo onto the mesh surface. Do that with multiple frames, and one is firmly in the Uncanny Valley.

19. sgt101 ◴[06 Nov 24 16:37 UTC] No.42064875[source]▶

>>42061089 (TP) #

It's not quite so - we couldn't handle it, and we didn't have it, so it was a bit of a none question.

I started with ML in 1994, I was in a small poor lab - so we didn't have state of the art hardware. On the other hand I think my experience is fairly representative. We worked with data sets on spark workstations that were stored in flat files and had thousands or sometimes tens of thousands of instances. We had problems keeping our data sets on the machines and often archived them to tape.

Data came from very deliberate acquisition processes. For example I remember going to a field exercise with a particular device and directing it's use over a period of days in order to collect the data that would be needed for a machine learning project.

Sometime in the 2000's data started to be generated and collected as "exhaust" from various processes. People and organisations became instrumented in the sense that their daily activities were necessarily captured digitally. For a time this data was latent, people didn't really think about using it in the way that we think about it now, but by about 2010 it was obvious that not only was this data available but we had the processing and data systems to use it effectively.

20. sgt101 ◴[06 Nov 24 16:40 UTC] No.42064954{3}[source]▶

>>42063993 #

mid-60's had neural nets.

mid-90's had LeCun telling everyone that big neural nets were the future.

replies(1): >>42065537 #

21. UltraSane ◴[06 Nov 24 16:57 UTC] No.42065235{7}[source]▶

>>42063257 #

Why can't you use a rack as a unit of storage at the time? Were 19" server racks not in common use yet? The storage capacity of a rack will grow over time.

my storage hierarchy goes 1) 1 storage drive 2) 1 server maxed out with the biggest storage drives available 3) 1 rack filled with servers from 2 4) 1 data center filled with racks from 3

replies(1): >>42066284 #

22. dekhn ◴[06 Nov 24 17:14 UTC] No.42065537{4}[source]▶

>>42064954 #

Mid 90s I was working on neural nets and other machine learning, based on gradient descent, with manually computed derivatives, on genomic data (from what I can recall, we had no awareness of LeCun; I didnt find out about his great OCR results until much later). it worked fine and it seemed like a promising area.

My only surprise is how long it took to get to imagenet, but in retrospect, I appreciate that a number of conditions had to be met (much more data, much better algorithms, much faster computers). I also didn't recognize just how poorly MLPs were for sequence modelling, compared to RNNs and transformers.

replies(1): >>42069033 #

23. littlestymaar ◴[06 Nov 24 17:52 UTC] No.42066238{3}[source]▶

>>42063083 #

> they totally would have had it been technically feasible

TinyLlama[1] has been made by an individual on their own last year, training a 1.1B model on 3T tokens with just 16 A100-40G GPUs in 90 days. It was definitely within reach of any funded org in 2019.

In 2022 (IIRC), Google released the Chinchilla paper about the compute-optimal amount of data to train a given model, for a 1B model, the value was determined to be 20B tokens, which again is 3 orders of magnitude below the current state of the art for the same class of model.

Until very recently (the first llama paper IIRC, and people noticing that the 7B model showed no sign of saturation during its already very long training) the ML community vastly underestimated the amount of training data that was needed to make a LLM perform at its potential.

[1]: https://github.com/jzhang38/TinyLlama

24. fragmede ◴[06 Nov 24 17:54 UTC] No.42066284{8}[source]▶

>>42065235 #

How big is a rack in VW beetles though?

It's a terrible measurement because it's an irrelevant detail about how their data is stored that no one actually knows if your data is being stored in a proprietary cloud except for people that work there on that team.

So while someone could say they used a 10 TiB data set, or 10T parameters, how many "racks" of AWS S3 that is, is not known outside of Amazon.

replies(1): >>42072934 #

25. philipkglass ◴[06 Nov 24 18:06 UTC] No.42066509{3}[source]▶

>>42062715 #

Who's offering VC money for neural network porn technology? As far as I can tell, there is huge organic demand for this but prospective users are mostly cheapskates and the area is rife with reputational problems, app store barriers, payment processor barriers, and regulatory barriers. In practice I have only ever seen investors scared off by hints that a technology/platform would be well matched to adult entertainment.

26. disgruntledphd2 ◴[06 Nov 24 18:31 UTC] No.42066977{3}[source]▶

>>42064389 #

Even then maybe Google but probably not Facebook. Ads used ML but there wasn't that much of it in feed. Like, there were a bunch of CV projects that I saw in 2013 that didn't use NNs. Three years later, otoh you couldn't find a devserver without tripping over an NN along the way.

27. sgt101 ◴[06 Nov 24 20:42 UTC] No.42069033{5}[source]▶

>>42065537 #

I'm so out of things ! What do you mean manually computed derivatives?

replies(2): >>42071400 #>>42072510 #

28. mistrial9 ◴[06 Nov 24 23:46 UTC] No.42071400{6}[source]▶

>>42069033 #

it means that code has to read values from each layer and do some summarizing math, instead of passing layer blocks to a graphics card in one primitive operation implemented on the card.

replies(1): >>42072523 #

29. dekhn ◴[07 Nov 24 02:05 UTC] No.42072510{6}[source]▶

>>42069033 #

I mean we didn't know autodifferentiation was a thing, so we (my advisor, not me) analytically solved our loss function for its partial derivatives. After I wrote up my thesis, I spent a lot of time learning mathematica and advanced calculus.

I haven't invested the time to take the loss function from our paper and implement in a modern framework, but IIUC, I wouldn't need to provide the derivatives manually. That would be a satisfying outcome (indicating I had wasted a lot of effort learning math that simply wasn't necessary, because somebody had automated it better than I could do manually, in a way I can understand more easily).

replies(1): >>42078092 #

30. dekhn ◴[07 Nov 24 02:06 UTC] No.42072523{7}[source]▶

>>42071400 #

No. I should have said "determined the partial derivatives of the weights with respect to the variables analytically". We didn't have layers- the whole architecture was a truly crazy combination of dynamic programming with multiple different matrices and a loss function that combined many different types of evidence. AFAICT nobody does any of this any more for finding genes. We just take enormous amounts of genetic data and run an autoencoder or a sequence model over it.

31. UltraSane ◴[07 Nov 24 03:06 UTC] No.42072934{9}[source]▶

>>42066284 #

a 42U 19" inch rack is an industry standard. If you actually work on the physical infrastructure of data centers it is most CERTAINLY NOT an irrelevant detail.

And whether your data can fit on a single server, single rack, or many racks will drastically affect how you design the infrastructure.

replies(1): >>42082900 #

32. robotresearcher ◴[07 Nov 24 07:48 UTC] No.42074497{3}[source]▶

>>42063993 #

I worked on robot control with NNs in the early-mid nineties. Maybe seven neurons and 25 edges. No layers at all. The graph and edge weights determined by a genetic algorithm. Fun.

33. rramadass ◴[07 Nov 24 11:10 UTC] No.42075640[source]▶

>>42061818 #

Very well said !

34. telotortium ◴[07 Nov 24 16:27 UTC] No.42078092{7}[source]▶

>>42072510 #

I can't express the extent to which autodifferentiation was like a revelation to me. I don't work in ML, but in grad school around 2010 I was implementing density functional theory computations in a code that was written in Fortran 77. My particular optimization needs required computing to second derivatives. I had Mathematica to actually calculate the derivatives, but even just the step of mechanically translating the computed derivatives into Fortran 77 code would be a week of tedious work. Worse was rewriting these derivative expressions for numerical stability. The worst was realizing you made a mistake in an expression high in the tree and having to rewrite everything below. The whole process took months for a single model, and that's with chain rule depth that probably could be counted on one hand. I can't imagine deep learning making the kind of progress it has without autodifferentiation - the only saving grace is that neural networks tend to be composed from large number of copies of identical functions, and you only need to go to first derivatives.

35. fragmede ◴[08 Nov 24 00:45 UTC] No.42082900{10}[source]▶

>>42072934 #

A standard so standard you had to give two of the dimensions so as not to confuse it with something else? Like a 48 U tall data center rack, or a 23" wide telco rack?

Okay, so it is relatively standard these days, but the problem is you can change how many "U" or racks you need for the same amount of storage based on how you want to arrange it, for a given use case which will affect access patterns and how it's wired up. A single server could be a compute box hosting no disks (at which point your dataset at rest won't even fit) or 4U holding 60 SATA drives vertically, at which point you could get 60*32TiB, 1.9 pebibytes for your data in 2024, but it would be a bit slow and have no redundancy. You could fit ten of those in a single rack for 19 petabytes with no tor switch, and just run twenty 1-gig Ethernet cables out (two per server) but what would be the point of that, other than a vendor trying to sell you something?

Anyway, so say you're told the dataset is 1 petabytes in 2024, is it on a single server or spread across many; possibly duplicated across multiple racks as well? You want to actually read the data at some point, and properly tuning storage array(s) to keeping workers fed and not bottleneck on reading the data off storage may involve some changes to the system layout if you don't have a datacenter fabric with that kind of capacity. Which puts us back at sharding the data in multiple places, at which point even though the data does fit on a single server, it's spread out across a bunch for performance reasons.

Trying to derive server layout from dataset size like asking about the number of lines of code used. A repo with 1 million LoC is different from one with 1,000, sure, but what can you really get from that?

36. cubefox ◴[20 Nov 24 00:07 UTC] No.42189460[source]▶

>>42061089 (TP) #

> That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.

The earliest paper I know which says this explicitly is "The Unreasonable Effectiveness of Data" from 2009, only two years before AlexNet:

https://static.googleusercontent.com/media/research.google.c...

It's about machine translation.

↑