←back to thread

235 points tosh | 1 comments | | HN request time: 0s | source
Show context
xanderlewis ◴[] No.40214349[source]
> Stripped of anything else, neural networks are compositions of differentiable primitives

I’m a sucker for statements like this. It almost feels philosophical, and makes the whole subject so much more comprehensible in only a single sentence.

I think François Chollet says something similar in his book on deep learning: one shouldn’t fall into the trap of anthropomorphising and mysticising models based on the ‘neural’ name; deep learning is simply the application of sequences of operations that are nonlinear (and hence capable of encoding arbitrary complexity) but nonetheless differentiable and so efficiently optimisable.

replies(12): >>40214569 #>>40214829 #>>40215168 #>>40215198 #>>40215245 #>>40215592 #>>40215628 #>>40216343 #>>40216719 #>>40216975 #>>40219489 #>>40219752 #
gessha ◴[] No.40215198[source]
It is soothing to the mind because it conveys that it’s understandable but it doesn’t take away from the complexity. You still have to read through math and pytorch code and debug nonsensical CUDA errors, comb through the data, etc etc
replies(1): >>40215394 #
whimsicalism ◴[] No.40215394[source]
the complexity is in the values learned from the optimization. even the pytorch code for a simple transformer is not that complex, attention is a simple mechanism, etc.
replies(1): >>40215998 #
gessha ◴[] No.40215998[source]
Complexity also comes from the number of papers that work out how different elements of network work and how to intuitively change them.

Why do we use conv operators, why do we use attention operators, when do we use one over the other? What augmentations do you use, how big of a dataset do you need, how do you collect the dataset, etc etc etc

replies(1): >>40216078 #
whimsicalism ◴[] No.40216078{3}[source]
idk, just using attention and massive web crawls gets you pretty far. a lot of the rest is more product-style decisions about what personality you want your LM to take.

I fundamentally don't think this technology is that complex.

replies(1): >>40219323 #
gessha ◴[] No.40219323{4}[source]
No? In his recent tutorial, Karpathy showed just how much complexity there is in the tokenizer.

This technology has been years in the making with many small advances pushing the performance ever so slightly. There’s been theoretical and engineering advances that contributed to where we are today. And we need many more to get the technology to an actually usable level instead of the current word spaghetti that we get.

Also, the post is generally about neural networks and not just LMs.

When making design decisions about an ML system you shouldn’t just choose the attention hammer and hammer away. There’s a lot of design constraints you need to consider which is why I made the original reply.

replies(1): >>40225160 #
1. whimsicalism ◴[] No.40225160{5}[source]
Are there micro-optimizations that eke out small advancements? Yes, absolutely - the modern tokenizer is a good example of that.

Is the core of the technology that complex? No. You could get very far with a naive tokenizer that just tokenized by words and replaced unknown words with <unk>. This is extremely simple to implement and I've trained transformers like this. It (of course) makes a perplexity difference but the core of the technology is not changed and is quite simple. Most of the complexity is in the hardware, not the software innovations.

> And we need many more to get the technology to an actually usable level instead of the current word spaghetti that we get.

I think the current technology is useable.

> you shouldn’t just choose the attention hammer and hammer away

It's a good first choice of hammer, tbph.