←back to thread

235 points tosh | 2 comments | | HN request time: 0.506s | source
Show context
xanderlewis ◴[] No.40214349[source]
> Stripped of anything else, neural networks are compositions of differentiable primitives

I’m a sucker for statements like this. It almost feels philosophical, and makes the whole subject so much more comprehensible in only a single sentence.

I think François Chollet says something similar in his book on deep learning: one shouldn’t fall into the trap of anthropomorphising and mysticising models based on the ‘neural’ name; deep learning is simply the application of sequences of operations that are nonlinear (and hence capable of encoding arbitrary complexity) but nonetheless differentiable and so efficiently optimisable.

replies(12): >>40214569 #>>40214829 #>>40215168 #>>40215198 #>>40215245 #>>40215592 #>>40215628 #>>40216343 #>>40216719 #>>40216975 #>>40219489 #>>40219752 #
captainclam ◴[] No.40214829[source]
Ugh, exactly, it's so cool. I've been a deep learning practitioner for ~3 years now, and I feel like this notion has really been impressed upon me only recently.

I've spent an awful lot of mental energy trying to conceive of how these things work, when really it comes down to "does increasing this parameter improve the performance on this task? Yes? Move the dial up a bit. No? Down a bit..." x 1e9.

And the cool part is that this yields such rich, interesting, sometimes even useful, structures!

I like to think of this cognitive primitive as the analogue to the idea that thermodynamics is just the sum of particles bumping into each other. At the end of the day, that really is just it, but the collective behavior is something else entirely.

replies(3): >>40214973 #>>40215025 #>>40218156 #
kadushka ◴[] No.40218156[source]
it comes down to "does increasing this parameter improve the performance on this task? Yes? Move the dial up a bit. No? Down a bit..." x 1e9

This is not how gradient based NN optimization works. What you described is called "random weight perturbation", a variant of evolutionary algorithms. It does not scale to networks larger than a few thousand parameters for obvious reasons.

NNs are optimized by directly computing a gradient which tells us the direction to go to to reduce the loss on the current batch of training data. There's no trying up or down and seeing if it worked - we always know which direction to go.

SGD and RWP are two completely different approaches to learning optimal NN weights.

replies(2): >>40218281 #>>40219050 #
1. captainclam ◴[] No.40219050[source]
I guess you could say I don't know RWP from Adam! :D

My og comment wasn't to accurately explain gradient optimization, I was just expressing a sentiment not especially aimed at experts and not especially requiring details.

Though I'm afraid I subjected you to the same "cringe" I experience when I read pop sci/tech articles describe deep learning optimization as "the algorithm" being "rewarded" or "punished," haha.

replies(1): >>40219538 #
2. kadushka ◴[] No.40219538[source]
No worries, we're all friends here!

it's just you happened to accidentally describe the idea behind RWP, which is a gradient-free optimization method, so I thought I should point it out.