I’m a sucker for statements like this. It almost feels philosophical, and makes the whole subject so much more comprehensible in only a single sentence.
I think François Chollet says something similar in his book on deep learning: one shouldn’t fall into the trap of anthropomorphising and mysticising models based on the ‘neural’ name; deep learning is simply the application of sequences of operations that are nonlinear (and hence capable of encoding arbitrary complexity) but nonetheless differentiable and so efficiently optimisable.
The actual primitive functions in this case would be things like the weighted sums of activations in the previous layer to get the activation of a given layer, and the actual ‘activation functions’ (traditionally something like a sigmoid function; these days a ReLU) associated with each layer.
‘Primitives’ is also sometimes used as a synonym for antiderivatives, but I don’t think that’s what it means here.
Edit: it just occurred to me from a comment below that you might have meant to ask what the ‘differentiable’ part means. See https://en.wikipedia.org/wiki/Differentiable_function.
I've spent an awful lot of mental energy trying to conceive of how these things work, when really it comes down to "does increasing this parameter improve the performance on this task? Yes? Move the dial up a bit. No? Down a bit..." x 1e9.
And the cool part is that this yields such rich, interesting, sometimes even useful, structures!
I like to think of this cognitive primitive as the analogue to the idea that thermodynamics is just the sum of particles bumping into each other. At the end of the day, that really is just it, but the collective behavior is something else entirely.
Exactly. It’s not to say that neat descriptions like this are the end of the story (or even the beginning of it). If they were, there would be no need for this entire field of study.
But they are cool, and can give you a really clear conceptualisation of something that can appear more like a sum of disjoint observations and ad hoc tricks than a discipline based on a few deep principles.
He also wrote What the Tortoise Said to Achilles (1895) in which the paradoxes of Zeno are discussed.
So it's more correct to say that GEB and this article are originally inspired by Lewis Carrol's work.
[1] I wrote a short article for my university magazine a long time ago. Some interesting references at the end https://abd.tiddlyspot.com/#%5B%5BMathematical%20Adventures%...
Only recently have I begun to appreciate that the simplicity of the operation, applied to a large enough matrices, may still capture enough of the nature of intelligence and sentience. In the end we can be broken down into (relatively) simple chemical reactions, and it is the massive scale of these reactions that create real intelligence and sentience.
The success of deep learning is basically attributable to composable (expressive), differentiable (learnable) functions. The "deep" moniker alludes to the compositionality.
> I’m a sucker for statements like this. It almost feels philosophical, and makes the whole subject so much more comprehensible in only a single sentence.
And I hate inaccurate statements like this. It pretends to be rigorous mathematical, but really just propagates erroneous information, and makes the whole article so much more amateur in only a single sentence.
The simple relu is continuous but not differentiable at 0, and its derivative is discontinuous at 0.
If you want to have a war of petty pedantry, let’s go: the derivative of ReLU can’t be discontinuous at zero, as you say, because continuity (or indeed discontinuity) of a function at x requires the function to have a value at x (which is the negation of what your first statement correctly claims).
Quickly skimming the draft pdf at https://arxiv.org/pdf/2404.17625 I can grok it instantly, because it's written in familiar academic language instead of gobbledygook. Anyone with an undergrad math education in engineering, computer science, etc or a self-taught equivalent understanding of differential equations should be able to read it easily. It does a really good job of connecting esoteric terms like tensors with arrays, gradients with partial derivatives, Jacobians with gradients and backpropagation with gradient descent in forward/reverse mode automatic differentiation. Which helps the reader to grasp the fundamentals instead of being distracted by the implementation details of TensorFlow, CUDA, etc. Some notable excerpts:
Introduction (page 4):
By viewing neural networks as simply compositions of differentiable primitives we can ask two basic questions (Figure F.1.3): first, what data types can we handle as inputs or outputs? And second, what sort of primitives can we use? Differentiability is a strong requirement that does not allow us to work directly with many standard data types, such as characters or integers, which are fundamentally discrete and hence discontinuous. By contrast, we will see that differentiable models can work easily with more complex data represented as large arrays (what we will call tensors) of numbers, such as images, which can be manipulated algebraically by basic compositions of linear and nonlinear transformations.
Chapter 2.2 Gradients and Jacobians (page 23): [just read this section - it connects partial derivatives, gradients, Jacobians and Taylor’s theorem - wow!]
Chapter 4.1.5 Some computational considerations (page 59): In general, we will always prefer algorithms that scale linearly both in the feature dimension c and in the batch size n, since super-linear algorithms will become quickly impractical (e.g., a batch of 32 RGB images of size 1024×1024 has c ≈ 1e7). We can avoid a quadratic complexity in the equation of the gradient by computing the multiplications in the correct order, i.e., computing the matrix-vector product Xw first. Hence, pure gradient descent is linear in both c and n, but only if proper care is taken in the implementation: generalizing this idea is the fundamental insight for the development of reverse-mode automatic differentiation, a.k.a. back-propagation (Section 6.3).
Chapter 6 Automatic differentiation (page 87): We consider the problem of efficiently computing gradients of generic computational graphs, such as those induced by optimizing a scalar loss function on a fully-connected neural network, a task called automatic differentiation (AD) [BPRS18]. You can think of a computational graph as the set of atomic operations (which we call primitives) obtained by running the program itself. We will consider sequential graphs for brevity, but everything can be easily extended to more sophisticated, acyclic computational graphs.
The problem may seem trivial, since the chain rule of Jacobians (Section 2.2, (E.2.22)) tells us that the gradient of function composition is simply the matrix product of the corresponding Jacobian matrices. However, efficiently implementing this is the key challenge, and the resulting algorithm (reverse-mode AD or backpropagation) is a cornerstone of neural networks and differentiable programming in general [GW08, BR24]. Understanding it is also key to understanding the design (and the differences) of most frameworks for implementing and training such programs (such as TensorFlow or PyTorch or JAX). A brief history of the algorithm can be found in [Gri12].
Edit: I changed Chapter 2.2.3 Jacobians (page 27) to Chapter 2.2 Gradients and Jacobians (page 23) for better context.And yet, artificial neural networks ARE an approximation of how biological neurons work. It is worth noting that they came out of neurobiology and not some math department - well at least in the forward direction, I'm not sure who came up with the training algorithms (probably the math folks). Should they be considered mystical? No. I would also posit that biological neurons are more efficient and probably have better learning algorithms than artificial ones today.
I'm confused as to why some people seem to shun the biological equivalence of these things. In a recent thread here I learned that physical synaptic weights (in our brains) are at least partly stored in DNA or its methylation. If that isn't fascinating I'm not sure what is. Or is it more along the lines of intelligence can be reduced to a large number of simple things, and biology has given us an interesting physical implementation?
discontinuity of a function at x does not, according to the usual definition of 'continuity', require the function to have a value at x; indeed, functions that fail to have a value at x are necessarily discontinuous there, precisely because (as you say) they are not continuous there. https://en.wikipedia.org/wiki/Continuous_function#Definition...
there are other definitions of 'discontinuous' in use, but i can't think of one that would give the result you claim
artificial neural networks are an approximation of biological neural networks in the same way that a submarine is an approximation of a fish
Why do we use conv operators, why do we use attention operators, when do we use one over the other? What augmentations do you use, how big of a dataset do you need, how do you collect the dataset, etc etc etc
I fundamentally don't think this technology is that complex.
Sure. But what part of this entirely worded in natural language, and very short statement made you think it was a technical, formal statement? I think you’re just taking an opportunity to flex your knowledge of basic calculus, and deliberately attributing intent to the author that isn’t there in order to look clever.
Regarding a function being discontinuous at a point outside its domain: if you take a completely naive view of what ‘discontinuous’ means, then I suppose you can say so. But discontinuity is just the logical negation of continuity. Observe:
To say that f: X —> Y (in this context, a real-valued function of real numbers) is continuous means precisely
∀x∈X ∀ε>0 ∃δ>0 |x - p| < δ ⇒ |f(x) - f(p)| < ε
and so its negation looks like
∃x∈X ⌐ …
that is, there is a point in X, the domain of f where continuity fails.
For example, you wouldn’t talk about a function defined on the integers being discontinuous at pi, would you? That would just be weird.
To prove the point further, observe that the set of discontinuities (according to your definition) of any given function would actually include every number… in fact every mathematical object in the universe — which would make it not even a set in ZFC. So it’s absurd.
Even more reasons to believe functions can only be discontinuous at points of their domain: a function is said to be discontinuous if it has at least one discontinuity. By your definition, every function is discontinuous.
…anyway, I said we were going to be petty. I’m trying to demonstrate this is a waste of time by wasting my own time.
The original idea of approximating something like a neuron using a weighted sum (which is a fairly obvious idea, given the initial discovery that neurons become ‘activated’ and they do so in proportion to how much the neurons they are connected to are) did come from thinking about biological brains, but the mathematical building blocks are incredibly simple and are hundreds of years old, if not thousands.
You can certainly do function composition in lambda calculus: in fact, the act of composition itself is a higher order function (takes functions and returns a function) and you can certainly express it formally with lambda terms and such. It’s not really got anything to do with any particular language or model of computation though.
This is a difference of degree not of kind, because neural networks are Turning complete. Whatever additional complexity the neuron has can itself be modelled as a neural network.
Edit: meaning, that if the greater complexity of a biological neuron is relevant to its information processing component, then that just increases the number of artificial neural network neurons needed to describe it, it does not need any computation of a different kind.
One also shouldn't fall into the dual trap of assuming that just because one understands how a model works, it cannot have any bearing on the ever-mysterious operation of the brain.
Though other things fit this description which are not deep learning. Like (shameless plug) my recent paper here https://ieeexplore.ieee.org/document/10497907
I guess my question, is what are the primitive functions here doing?
The sign of a truly good conversation?
Maybe your question boils down to asking something more general like: what’s the difference between functions to a computer scientist (or a programmer) and functions to a mathematician? That is, are ‘functions’ in C (or lambda calculus), say, the same ‘functions’ we talk about in calculus?
The answer to that is: in this case, because these are quite simple functions (sums and products and compositions thereof) they’re the same. In general, they’re a bit different. The difference is basically the difference between functional programming and ‘traditional’ programming. If you have state/‘side effects’ of functions, then your function won’t be a function in the sense of mathematics; if the return value of your function depends entirely on the input and doesn’t return different values depending on whatever else is happening in the program, then it will be.
Since you’re asking about lambda calculus in particular, the answer is that they’re the same because lambda calculus doesn’t have state. It’s ‘purely functional’ in that sense.
>I guess my question, is what are the primitive functions here doing?
I’m not really sure what you mean. They’re doing what functions always do. Every computer program is abstractly a (partial) function.
Does that help, or have I misunderstood?
When I think of functions in a traditional mathematical sense, I think about transformations of numbers. x->2x, x->2x^2, etc. I completely understand composition of functions here, ex x->2(x->2x)^2, but its unclear how these transformations relate to computation. For a regression problem, I can totally understand how finding the right compositions of functions can lead to a better approximations. So I am wondering, in an LLM architecture, what computations do these functions actually represent? I assume, it has something to do with what path to take through the neural layers. I probably just need to take the time to study it deeper.
>If you have state/‘side effects’ of functions, then your function won’t be a function in the sense of mathematics; if the return value of your function depends entirely on the input and doesn’t return different values depending on whatever else is happening in the program, then it will be.
Totally understood from the perspective of functions in say, Java. Though fundamentally I don't think there is distinction between functions in computer science and mathematics. The program as a whole is effectively a function. The "global" state is from another reference, just local variables of the encompassing function. If a function is modifying variables outside of the "function block" (in say Java), the "input" to the function isn't just the parameters of the function. Imo, this is more of an artifact of implementation of some languages rather than a fundamental difference. Python for example requires declaring global args in the function block. Go one step further and require putting global args into the parameters list and you're pretty close to satisfying this.
The state of a neural network is described entirely by its parameters, which usually consist of a long array (well, a matrix, or a tensor, or whatever…) of floating point numbers. What is being optimised when a network is trained is these parameters and nothing else. When you evaluate a neural network on some input (often called performing ‘inference’), that is when the functions we’re talking about are used. You start with the input vector, and you apply all of those functions in order and you get the output vector of the network. The training process also uses these functions, because to train a network you have to perform evaluation repeatedly in between tweaking those parameters to make it better approximate the desired output for each input. Importantly, the functions do not change. They are constant; it’s the parameters that change. The functions are the architecture — not the thing being learned. Essentially what the parameters represent is how likely each neuron is to be activated (have a high value) if others in the previous layer are. So you can think of the parameters as encoding strengths of connections between each pair of neurons in consecutive layers. Thinking about ‘what path to take through the neural layers’ is way too sophisticated — it’s not doing anything like that.
> Though fundamentally I don't think there is distinction between functions in computer science and mathematics. The program as a whole is effectively a function.
You’re pretty much right about that, but there are two important problems/nitpicks:
(1) We can’t prove (in general) that a given program will halt and evaluate to something (rather than just looping forever) on a given input, so the ‘entire program’ is instead what’s called a partial function. This means that it’s still a function on its domain — but we can’t know what its precise domain is. Given an input, it may or may not produce an output. If it does, though, it’s well defined because it’s a deterministic process.
(2) You’re right to qualify that it’s the whole program that is (possibly) a function. If you take a function from some program that depends on some state in that same program, then clearly that function won’t be a proper ‘mathematical’ function. Sure, if you incorporate that extra state as one of your inputs, it might be, but that’s a different function. You have to remember that in mathematics, unlike in programming, a function consists essentially of three pieces of data: a domain, a codomain, and a ‘rule’. If you want to be set-theoretic and formal about it, this rule is just a subset of the cartesian product of its domain and codomain (it’s a set of pairs of the form (x, f(x))). If you change either of these sets, it’s technically a different function and there are good reasons for distinguishing between these. So it’s not right to say that mathematical functions and functions in a computer program are exactly the same.
(1) Neural networks are Turing complete, and hence can do anything brains can. [debatable anyway; We don’t know this to be the case since brains might be doing more than computation. Ask a philosopher or a cognitive scientist. Or Roger Penrose.]
(2) Neural networks were very loosely inspired by the idea that the human brain is made up of interconnected nodes that ‘activate’ in proportion to how other related nodes do.
I don’t think that’s nearly enough to say that they’re equivalent. For (1), we don’t yet know (and we’re not even close), and anyway: if you consider all Turing complete systems to be equivalent to the point of it being a waste of time to talk about their differences then you can say goodbye to quite a lot of work in theoretical computer science. For (2): so what? Lots of things are inspired by other things. It doesn’t make them in any sense equivalent, especially if the analogy is as weak as it is in this case. No neuroscientist thinks that a weighted sum is an adequate (or even remotely accurate) model of a real biological neuron. They operate on completely different principles, as we now know much better than when such things were first dreamed up.
Perhaps fundamentally they are not, but its also true that just writing more and more random assembly code isn't going to lead to an LLM.
This is not how gradient based NN optimization works. What you described is called "random weight perturbation", a variant of evolutionary algorithms. It does not scale to networks larger than a few thousand parameters for obvious reasons.
NNs are optimized by directly computing a gradient which tells us the direction to go to to reduce the loss on the current batch of training data. There's no trying up or down and seeing if it worked - we always know which direction to go.
SGD and RWP are two completely different approaches to learning optimal NN weights.
As for equivalency, that depends on how that's defined. Real neurons would not feature any more computational power than Turing machines or artificial neural networks, but I never said it would be a waste of time to talk about their differences. I merely pointed out that the artificial neural network model is still sufficient, even if real neurons have more complexity.
> No neuroscientist thinks that a weighted sum is an adequate (or even remotely accurate) model of a real biological neuron
Fortunately that's not what I said. If the neuron indeed has more relevant complexity, then it wouldn't be one weighted sum = one biological neuron, but one biological neuron = a network of weighted sums, since such a network can model any function.
If you’re interested in pure computational ‘power’, then if the brain is nothing more than a Turing machine (which, as you agree, it might not be), fine. You can call them ‘equivalent’. It’s just not very meaningful.
What’s interesting about neural nets has nothing to do with what they can compute; indeed they can compute anything any other Turing machine can, and nothing more. What’s interesting is how they do it, since they can ‘learn’ and hence allow us to produce solutions to hard problems without any explicit programming or traditional analysis of the problem.
> that would overturn quite a bit of physics
Our physics is currently woefully incomplete, so… yes. That would be welcome.
The difference here is that it's just more obvious how to do this in one case than the other.
My point was only that 1) neural networks are sufficient, even if real neurons have additional complexity, and 2) whatever that additional complexity, artificial neural networks can learn to reproduce it.
You might think that it doesn't matter because ReLU is, e.g., non-differentiable "only at one point".
Gradient based methods (what you find in pytorch) generally rely on the idea that gradients should taper to 0 in the proximity of a local optimum. This is not the case for non-differentiable functions, and in fact gradients can be made to be arbitrarily large even very close to the optimum.
As you may imagine, it is not hard to construct examples where simple gradient methods that do not properly take these facts into account fail to converge. These examples are not exotic.
>Essentially what the parameters represent is how likely each neuron is to be activated (have a high value) if others in the previous layer are. So you can think of the parameters as encoding strengths of connections between each pair of neurons in consecutive layers. Thinking about ‘what path to take through the neural layers’ is way too sophisticated — it’s not doing anything like that.
Im a little confused. The discussion thus far about how neural networks are essentially just compositions of functions, but you are now saying that the function is static, and only the parameters change.
But that aside, if these parameters change which neurons are activated, and this activation affects which neurons are activated in the next layer, are these parameters effectively not changing the path taken through the layers?
>Sure, if you incorporate that extra state as one of your inputs, it might be, but that’s a different function.
So say we have this program " let c = 2; function 3sum (a,b) { return a+b + c; } let d = 3sum(3,4)"
I believe you are saying, if we had constructed this instead as
"function(a,b,c) { return a+b+c } let d = 3sum(3,4,2) "
then, this is a different function.
Certainly, these are different in a sense, but at a fundamental level, when you compile this all down and run it, there is an equivalency in the transformation that is happening. That is, the two functions equivalently take some input state A (composed of a,b,c) and return the same output state B, while applying the same intermediary steps (add a to b, add c to result of (add to b)). Really, in the first case where c is defined outside the scope of the function block, the interpreter is effectively producing the function 3sum(x,y,c) as it has to at some point, one way or another, inject c into a+b+c.
Similarly, I am won't argue that the current, formal definitions of functions in mathematics are exactly that of functions as they're generally defined in programming.
Rather, what I saying is that there is an equivalent way to think and study functions that equally apply to both fields. That is, a function is simply a transformation from A to B, where A and B can be anything, whether that is bits, numbers, or any other construction in any system. The only primitive distinction to make here is whether A and B are the same thing or different.
Just as a modeling and running a single neuron takes x amount of transistors configured in a very specific way for example, it may take y amount of neurons arranged in some very specific, unknown to model something that has extra properties.
And its not clear either whether neurons are fundamentally the correct approach to reach this higher level construction than some other kind of node.
The meaning of the gradient is perfectly adequately described by the author. They weren’t describing an algorithm for computing it.
My og comment wasn't to accurately explain gradient optimization, I was just expressing a sentiment not especially aimed at experts and not especially requiring details.
Though I'm afraid I subjected you to the same "cringe" I experience when I read pop sci/tech articles describe deep learning optimization as "the algorithm" being "rewarded" or "punished," haha.
This technology has been years in the making with many small advances pushing the performance ever so slightly. There’s been theoretical and engineering advances that contributed to where we are today. And we need many more to get the technology to an actually usable level instead of the current word spaghetti that we get.
Also, the post is generally about neural networks and not just LMs.
When making design decisions about an ML system you shouldn’t just choose the attention hammer and hammer away. There’s a lot of design constraints you need to consider which is why I made the original reply.
It's kind of like saying, "Stripped of anything else, works of literature are compositions of words"
even in the case of a single discontinuity in the derivative, like in relu', you lose the intermediate value theorem and everything that follows from it; it's not an inconsequential or marginally relevant fact
For a non-vapid/non-vacuous definition of 'approximation' this is not true at all. It is well understood that (i) back-propagation is biologically infeasible in the brain (ii) output 'voltage' is a transformed weighted average of the input 'voltage' -- is not how neurons operate. (ii) is in the 'not even wrong' category.
Neurons operate in terms of spikes and frequency and quiescence of spiking. If you are interested any undergrad text in neurobiology will help correct the wrong notions.
Nope.
Neurons in our brain operate fundamentally differently. They work by transient spikes and information is carried not by the intensity of the spike voltage, but by the frequency of spiking. This is a fundamentally different phenomenon than ANNs where the output (voltage) is a squash transformed aggregated input values (voltages).
Only if you limit yourself to "sums of weighted inputs, sent through a 1D activation function".
However, the parent said "differentiable primitives": these days people have built networks that contain differentiable ray-tracers, differentiable physics simulations, etc. Those seem like crazy ideas if we limit ourselves to the "neural" analogy; but are quite natural for a "composition of differentiable primitives" approach.
My conclusion is we tend to overestimate our understanding and the power of our inventions.
I started self-studying programming some time ago, then pivoted to AI/ML and (understandably) ended up mostly studying math, these resources are a boon to my folk.
Edit to add: I was mostly trying to push back on the implication that Disney owns Alice in Wonderland (and Peter Pan, Winnie the Pooh, etc). Now I re-read the original comment, they did specify "Disney-based", so maybe I'm over-reacting!
Is the core of the technology that complex? No. You could get very far with a naive tokenizer that just tokenized by words and replaced unknown words with <unk>. This is extremely simple to implement and I've trained transformers like this. It (of course) makes a perplexity difference but the core of the technology is not changed and is quite simple. Most of the complexity is in the hardware, not the software innovations.
> And we need many more to get the technology to an actually usable level instead of the current word spaghetti that we get.
I think the current technology is useable.
> you shouldn’t just choose the attention hammer and hammer away
It's a good first choice of hammer, tbph.
I thought they worked like accumulators where the spike "energy" accumulates until the output "fires". If that's the case then the artificial NNs are still an approximation of that process. I agree that this is a significant difference, but the mathematical version is still a rough approximation inspired by the biological one.
I stand by my position that having a mathematical proof of computational universality is a significant difference that separates today from all prior eras that sought to understand the brain through contemporaneous technology.
That’s not what I’m talking about. This is a basic analysis topic:
https://en.m.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_th...
At least mid 1800s for a proof. 1700s also explored Fourier series.
> stand by my position
And you’re still ignoring the cybernetics, and perceptrons movement I keep referring to which was more than 100 years ago, and informed by Turing.
It's the same basic flaw: requiring continuous functions. Not all functions are continuous, therefore this is not sufficient.
> And you’re still ignoring the cybernetics, and perceptrons movement I keep referring to which was more than 100 years ago, and informed by Turing.
What about them? As long as they're universal, they can all simulate brains. Anything after Church and Turing is just window dressing. Notice how none of these new ideas claimed to change what could in principle be computed, only how much easier or more natural this paradigm might be for simulating or creating brains.
It’s also a different reason than Taylor series which uses differentiability.
You do not understand this subject. Please read before repeating this: https://en.m.wikipedia.org/wiki/Universal_approximation_theo...
> what about them
Then you seem to have lost the subject of the thread.
There are ANN models that model these spike trains (that's what these 'avalanches' are called), these do work similar to real neurons, but they are not part of the deep neural network popularity [0,1]. Besides, backpropagation is not what goes on in the brain, its known to be biologically infeasible.
So all in all the traditional ANNs are nothing like real neural networks. That's ok, aeroplanes do not fly like birds, but they do still 'fly'.