Alice's adventures in a differentiable wonderland

(www.sscardapane.it)

Show context

xanderlewis ◴[30 Apr 24 18:23 UTC] No.40214349[source]▶

> Stripped of anything else, neural networks are compositions of differentiable primitives

I’m a sucker for statements like this. It almost feels philosophical, and makes the whole subject so much more comprehensible in only a single sentence.

I think François Chollet says something similar in his book on deep learning: one shouldn’t fall into the trap of anthropomorphising and mysticising models based on the ‘neural’ name; deep learning is simply the application of sequences of operations that are nonlinear (and hence capable of encoding arbitrary complexity) but nonetheless differentiable and so efficiently optimisable.

replies(12): >>40214569 #>>40214829 #>>40215168 #>>40215198 #>>40215245 #>>40215592 #>>40215628 #>>40216343 #>>40216719 #>>40216975 #>>40219489 #>>40219752 #

1. captainclam ◴[30 Apr 24 18:59 UTC] No.40214829[source]▶

>>40214349 #

Ugh, exactly, it's so cool. I've been a deep learning practitioner for ~3 years now, and I feel like this notion has really been impressed upon me only recently.

I've spent an awful lot of mental energy trying to conceive of how these things work, when really it comes down to "does increasing this parameter improve the performance on this task? Yes? Move the dial up a bit. No? Down a bit..." x 1e9.

And the cool part is that this yields such rich, interesting, sometimes even useful, structures!

I like to think of this cognitive primitive as the analogue to the idea that thermodynamics is just the sum of particles bumping into each other. At the end of the day, that really is just it, but the collective behavior is something else entirely.

replies(3): >>40214973 #>>40215025 #>>40218156 #

2. xanderlewis ◴[30 Apr 24 19:10 UTC] No.40214973[source]▶

>>40214829 (TP) #

> At the end of the day, that really is just it, but the collective behavior is something else entirely.

Exactly. It’s not to say that neat descriptions like this are the end of the story (or even the beginning of it). If they were, there would be no need for this entire field of study.

But they are cool, and can give you a really clear conceptualisation of something that can appear more like a sum of disjoint observations and ad hoc tricks than a discipline based on a few deep principles.

3. JackFr ◴[30 Apr 24 19:14 UTC] No.40215025[source]▶

>>40214829 (TP) #

NAND gates by themselves are kind of dull, but it's pretty cool what you can do with a billion of them.

4. kadushka ◴[01 May 24 00:30 UTC] No.40218156[source]▶

>>40214829 (TP) #

it comes down to "does increasing this parameter improve the performance on this task? Yes? Move the dial up a bit. No? Down a bit..." x 1e9

This is not how gradient based NN optimization works. What you described is called "random weight perturbation", a variant of evolutionary algorithms. It does not scale to networks larger than a few thousand parameters for obvious reasons.

NNs are optimized by directly computing a gradient which tells us the direction to go to to reduce the loss on the current batch of training data. There's no trying up or down and seeing if it worked - we always know which direction to go.

SGD and RWP are two completely different approaches to learning optimal NN weights.

replies(2): >>40218281 #>>40219050 #

5. xanderlewis ◴[01 May 24 00:46 UTC] No.40218281[source]▶

>>40218156 #

I don’t think the author literally meant tweaking the parameters and seeing what happens; it’s probably an analogy meant to give a sense of how the gradient indicates what direction and to what degree the parameters should be tweaked. Basically, substitute ‘the gradient is positive’ for ‘increasing this parameter decreases performance’ and vice versa and it becomes correct.

replies(1): >>40218390 #

6. p1esk ◴[01 May 24 01:01 UTC] No.40218390{3}[source]▶

>>40218281 #

That substitution is the main difference between SGD and RWP.

It’s like describing bubble sort when you meant to describe quick sort. Would not fly on an ML 101 exam, or in an ML job interview.

replies(2): >>40218743 #>>40219369 #

7. xanderlewis ◴[01 May 24 02:04 UTC] No.40218743{4}[source]▶

>>40218390 #

It’s not like that at all. You couldn’t accidentally sound like you’re describing quick sort when describing bubble sort, or vice versa. I can’t think of any substitution of a few words that would do that.

The meaning of the gradient is perfectly adequately described by the author. They weren’t describing an algorithm for computing it.

8. captainclam ◴[01 May 24 02:58 UTC] No.40219050[source]▶

>>40218156 #

I guess you could say I don't know RWP from Adam! :D

My og comment wasn't to accurately explain gradient optimization, I was just expressing a sentiment not especially aimed at experts and not especially requiring details.

Though I'm afraid I subjected you to the same "cringe" I experience when I read pop sci/tech articles describe deep learning optimization as "the algorithm" being "rewarded" or "punished," haha.

replies(1): >>40219538 #

9. a_random_canuck ◴[01 May 24 03:59 UTC] No.40219369{4}[source]▶

>>40218390 #

I don’t think anyone is trying to pass an exam here, but just give an understandable overview to a general audience.

10. kadushka ◴[01 May 24 04:34 UTC] No.40219538{3}[source]▶

>>40219050 #

No worries, we're all friends here!

it's just you happened to accidentally describe the idea behind RWP, which is a gradient-free optimization method, so I thought I should point it out.

↑