Alice's adventures in a differentiable wonderland

(www.sscardapane.it)

235 points tosh | 1 comments | 30 Apr 24 17:03 UTC | HN request time: 0s | source

Show context

xanderlewis ◴[30 Apr 24 18:23 UTC] No.40214349[source]▶

> Stripped of anything else, neural networks are compositions of differentiable primitives

I’m a sucker for statements like this. It almost feels philosophical, and makes the whole subject so much more comprehensible in only a single sentence.

I think François Chollet says something similar in his book on deep learning: one shouldn’t fall into the trap of anthropomorphising and mysticising models based on the ‘neural’ name; deep learning is simply the application of sequences of operations that are nonlinear (and hence capable of encoding arbitrary complexity) but nonetheless differentiable and so efficiently optimisable.

replies(12): >>40214569 #>>40214829 #>>40215168 #>>40215198 #>>40215245 #>>40215592 #>>40215628 #>>40216343 #>>40216719 #>>40216975 #>>40219489 #>>40219752 #

gessha ◴[30 Apr 24 19:28 UTC] No.40215198[source]▶

>>40214349 #

It is soothing to the mind because it conveys that it’s understandable but it doesn’t take away from the complexity. You still have to read through math and pytorch code and debug nonsensical CUDA errors, comb through the data, etc etc

replies(1): >>40215394 #

whimsicalism ◴[30 Apr 24 19:42 UTC] No.40215394[source]▶

>>40215198 #

the complexity is in the values learned from the optimization. even the pytorch code for a simple transformer is not that complex, attention is a simple mechanism, etc.

replies(1): >>40215998 #

gessha ◴[30 Apr 24 20:36 UTC] No.40215998[source]▶

>>40215394 #

Complexity also comes from the number of papers that work out how different elements of network work and how to intuitively change them.

Why do we use conv operators, why do we use attention operators, when do we use one over the other? What augmentations do you use, how big of a dataset do you need, how do you collect the dataset, etc etc etc

replies(1): >>40216078 #

whimsicalism ◴[30 Apr 24 20:41 UTC] No.40216078{3}[source]▶

>>40215998 #

idk, just using attention and massive web crawls gets you pretty far. a lot of the rest is more product-style decisions about what personality you want your LM to take.

I fundamentally don't think this technology is that complex.

replies(1): >>40219323 #

gessha ◴[01 May 24 03:51 UTC] No.40219323{4}[source]▶

>>40216078 #

No? In his recent tutorial, Karpathy showed just how much complexity there is in the tokenizer.

This technology has been years in the making with many small advances pushing the performance ever so slightly. There’s been theoretical and engineering advances that contributed to where we are today. And we need many more to get the technology to an actually usable level instead of the current word spaghetti that we get.

Also, the post is generally about neural networks and not just LMs.

When making design decisions about an ML system you shouldn’t just choose the attention hammer and hammer away. There’s a lot of design constraints you need to consider which is why I made the original reply.

replies(1): >>40225160 #

1. whimsicalism ◴[01 May 24 16:02 UTC] No.40225160{5}[source]▶

>>40219323 #

Are there micro-optimizations that eke out small advancements? Yes, absolutely - the modern tokenizer is a good example of that.

Is the core of the technology that complex? No. You could get very far with a naive tokenizer that just tokenized by words and replaced unknown words with <unk>. This is extremely simple to implement and I've trained transformers like this. It (of course) makes a perplexity difference but the core of the technology is not changed and is quite simple. Most of the complexity is in the hardware, not the software innovations.

> And we need many more to get the technology to an actually usable level instead of the current word spaghetti that we get.

I think the current technology is useable.

> you shouldn’t just choose the attention hammer and hammer away

It's a good first choice of hammer, tbph.

↑