Most active commenters
  • UltraSane(4)
  • dekhn(3)

←back to thread

306 points slyall | 19 comments | | HN request time: 3.71s | source | bottom
Show context
kleiba ◴[] No.42061089[source]
> “Pre-ImageNet, people did not believe in data,” Li said in a September interview at the Computer History Museum. “Everyone was working on completely different paradigms in AI with a tiny bit of data.”

That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.

replies(7): >>42061617 #>>42061818 #>>42061987 #>>42063019 #>>42063076 #>>42064875 #>>42189460 #
1. FrustratedMonky ◴[] No.42061617[source]
Not really. This is referring back to the 80's. People weren't even doing 'ML'. And back then people were more focused on teasing out 'laws' in as few data points as possible. The focus was more on formulas and symbols, and finding relationships between individual data points. Not the broad patterns we take for granted today.
replies(2): >>42062250 #>>42063993 #
2. criddell ◴[] No.42062250[source]
I would say using backpropagation to train multi-layer neural networks would qualify as ML and we were definitely doing that in 80's.
replies(1): >>42062594 #
3. UltraSane ◴[] No.42062594[source]
Just with tiny amounts of data.
replies(1): >>42062627 #
4. jensgk ◴[] No.42062627{3}[source]
Compared to today. We thought we used large amounts of data at the time.
replies(1): >>42062803 #
5. UltraSane ◴[] No.42062803{4}[source]
"We thought we used large amounts of data at the time."

Really? Did it take at least an entire rack to store?

replies(1): >>42063257 #
6. jensgk ◴[] No.42063257{5}[source]
We didn't measure data size that way. At some point in the future someone would find this dialog, and think that we dont't have large amounts of data now, because we are not using entire solar systems for storage.
replies(1): >>42065235 #
7. mistrial9 ◴[] No.42063993[source]
mid-90s had neural nets, even a few popular science kinds of books on it. The common hardware was so much less capable then.
replies(2): >>42064954 #>>42074497 #
8. sgt101 ◴[] No.42064954[source]
mid-60's had neural nets.

mid-90's had LeCun telling everyone that big neural nets were the future.

replies(1): >>42065537 #
9. UltraSane ◴[] No.42065235{6}[source]
Why can't you use a rack as a unit of storage at the time? Were 19" server racks not in common use yet? The storage capacity of a rack will grow over time.

my storage hierarchy goes 1) 1 storage drive 2) 1 server maxed out with the biggest storage drives available 3) 1 rack filled with servers from 2 4) 1 data center filled with racks from 3

replies(1): >>42066284 #
10. dekhn ◴[] No.42065537{3}[source]
Mid 90s I was working on neural nets and other machine learning, based on gradient descent, with manually computed derivatives, on genomic data (from what I can recall, we had no awareness of LeCun; I didnt find out about his great OCR results until much later). it worked fine and it seemed like a promising area.

My only surprise is how long it took to get to imagenet, but in retrospect, I appreciate that a number of conditions had to be met (much more data, much better algorithms, much faster computers). I also didn't recognize just how poorly MLPs were for sequence modelling, compared to RNNs and transformers.

replies(1): >>42069033 #
11. fragmede ◴[] No.42066284{7}[source]
How big is a rack in VW beetles though?

It's a terrible measurement because it's an irrelevant detail about how their data is stored that no one actually knows if your data is being stored in a proprietary cloud except for people that work there on that team.

So while someone could say they used a 10 TiB data set, or 10T parameters, how many "racks" of AWS S3 that is, is not known outside of Amazon.

replies(1): >>42072934 #
12. sgt101 ◴[] No.42069033{4}[source]
I'm so out of things ! What do you mean manually computed derivatives?
replies(2): >>42071400 #>>42072510 #
13. mistrial9 ◴[] No.42071400{5}[source]
it means that code has to read values from each layer and do some summarizing math, instead of passing layer blocks to a graphics card in one primitive operation implemented on the card.
replies(1): >>42072523 #
14. dekhn ◴[] No.42072510{5}[source]
I mean we didn't know autodifferentiation was a thing, so we (my advisor, not me) analytically solved our loss function for its partial derivatives. After I wrote up my thesis, I spent a lot of time learning mathematica and advanced calculus.

I haven't invested the time to take the loss function from our paper and implement in a modern framework, but IIUC, I wouldn't need to provide the derivatives manually. That would be a satisfying outcome (indicating I had wasted a lot of effort learning math that simply wasn't necessary, because somebody had automated it better than I could do manually, in a way I can understand more easily).

replies(1): >>42078092 #
15. dekhn ◴[] No.42072523{6}[source]
No. I should have said "determined the partial derivatives of the weights with respect to the variables analytically". We didn't have layers- the whole architecture was a truly crazy combination of dynamic programming with multiple different matrices and a loss function that combined many different types of evidence. AFAICT nobody does any of this any more for finding genes. We just take enormous amounts of genetic data and run an autoencoder or a sequence model over it.
16. UltraSane ◴[] No.42072934{8}[source]
a 42U 19" inch rack is an industry standard. If you actually work on the physical infrastructure of data centers it is most CERTAINLY NOT an irrelevant detail.

And whether your data can fit on a single server, single rack, or many racks will drastically affect how you design the infrastructure.

replies(1): >>42082900 #
17. robotresearcher ◴[] No.42074497[source]
I worked on robot control with NNs in the early-mid nineties. Maybe seven neurons and 25 edges. No layers at all. The graph and edge weights determined by a genetic algorithm. Fun.
18. telotortium ◴[] No.42078092{6}[source]
I can't express the extent to which autodifferentiation was like a revelation to me. I don't work in ML, but in grad school around 2010 I was implementing density functional theory computations in a code that was written in Fortran 77. My particular optimization needs required computing to second derivatives. I had Mathematica to actually calculate the derivatives, but even just the step of mechanically translating the computed derivatives into Fortran 77 code would be a week of tedious work. Worse was rewriting these derivative expressions for numerical stability. The worst was realizing you made a mistake in an expression high in the tree and having to rewrite everything below. The whole process took months for a single model, and that's with chain rule depth that probably could be counted on one hand. I can't imagine deep learning making the kind of progress it has without autodifferentiation - the only saving grace is that neural networks tend to be composed from large number of copies of identical functions, and you only need to go to first derivatives.
19. fragmede ◴[] No.42082900{9}[source]
A standard so standard you had to give two of the dimensions so as not to confuse it with something else? Like a 48 U tall data center rack, or a 23" wide telco rack?

Okay, so it is relatively standard these days, but the problem is you can change how many "U" or racks you need for the same amount of storage based on how you want to arrange it, for a given use case which will affect access patterns and how it's wired up. A single server could be a compute box hosting no disks (at which point your dataset at rest won't even fit) or 4U holding 60 SATA drives vertically, at which point you could get 60*32TiB, 1.9 pebibytes for your data in 2024, but it would be a bit slow and have no redundancy. You could fit ten of those in a single rack for 19 petabytes with no tor switch, and just run twenty 1-gig Ethernet cables out (two per server) but what would be the point of that, other than a vendor trying to sell you something?

Anyway, so say you're told the dataset is 1 petabytes in 2024, is it on a single server or spread across many; possibly duplicated across multiple racks as well? You want to actually read the data at some point, and properly tuning storage array(s) to keeping workers fed and not bottleneck on reading the data off storage may involve some changes to the system layout if you don't have a datacenter fabric with that kind of capacity. Which puts us back at sharding the data in multiple places, at which point even though the data does fit on a single server, it's spread out across a bunch for performance reasons.

Trying to derive server layout from dataset size like asking about the number of lines of code used. A repo with 1 million LoC is different from one with 1,000, sure, but what can you really get from that?