Alice's adventures in a differentiable wonderland

(www.sscardapane.it)

Show context

xanderlewis ◴[30 Apr 24 18:23 UTC] No.40214349[source]▶

> Stripped of anything else, neural networks are compositions of differentiable primitives

I’m a sucker for statements like this. It almost feels philosophical, and makes the whole subject so much more comprehensible in only a single sentence.

I think François Chollet says something similar in his book on deep learning: one shouldn’t fall into the trap of anthropomorphising and mysticising models based on the ‘neural’ name; deep learning is simply the application of sequences of operations that are nonlinear (and hence capable of encoding arbitrary complexity) but nonetheless differentiable and so efficiently optimisable.

replies(12): >>40214569 #>>40214829 #>>40215168 #>>40215198 #>>40215245 #>>40215592 #>>40215628 #>>40216343 #>>40216719 #>>40216975 #>>40219489 #>>40219752 #

andoando ◴[30 Apr 24 18:40 UTC] No.40214569[source]▶

>>40214349 #

What does "differentiable primitives" mean here?

replies(4): >>40214623 #>>40214658 #>>40215206 #>>40215221 #

xanderlewis ◴[30 Apr 24 18:46 UTC] No.40214658[source]▶

>>40214569 #

I think it’s referring to ‘primitive functions’ in the sense that they’re the building blocks of more complicated functions. If f and g are differentiable, f+g, fg, f/g (as long as g is never zero)… and so on are differentiable too. Importantly, f composed with g is also differentiable, and so since the output of the whole network as a function of its input is a composition of these ‘primitives’ it’s differentiable too.

The actual primitive functions in this case would be things like the weighted sums of activations in the previous layer to get the activation of a given layer, and the actual ‘activation functions’ (traditionally something like a sigmoid function; these days a ReLU) associated with each layer.

‘Primitives’ is also sometimes used as a synonym for antiderivatives, but I don’t think that’s what it means here.

Edit: it just occurred to me from a comment below that you might have meant to ask what the ‘differentiable’ part means. See https://en.wikipedia.org/wiki/Differentiable_function.

replies(1): >>40215568 #

andoando ◴[30 Apr 24 19:56 UTC] No.40215568[source]▶

>>40214658 #

Is this function composition essentially lambda calculus then?

replies(2): >>40216384 #>>40216613 #

1. xanderlewis ◴[30 Apr 24 21:31 UTC] No.40216613[source]▶

>>40215568 #

Composition here just means what it does for any two functions: the value of the ‘composition’ of f and g at x is defined to be f applied to g applied to x. In symbols, its: f∘g := f(g(x)) for each x in the domain of f. It may seem obvious, but the fact that this new thing is also a function (that is, its value is well-defined for every input) is actually a very useful thing indeed and leads to… well, most of mathematics.

You can certainly do function composition in lambda calculus: in fact, the act of composition itself is a higher order function (takes functions and returns a function) and you can certainly express it formally with lambda terms and such. It’s not really got anything to do with any particular language or model of computation though.

replies(1): >>40217050 #

2. andoando ◴[30 Apr 24 22:11 UTC] No.40217050[source]▶

>>40216613 (TP) #

I didn't form my question too well. I understand all that. What I am asking is, are these function compositions equivalent to equivalent/similar to functions in lambda calculus?

I guess my question, is what are the primitive functions here doing?

replies(1): >>40217296 #

3. xanderlewis ◴[30 Apr 24 22:38 UTC] No.40217296[source]▶

>>40217050 #

Well, yes, to the extent that functions are functions are functions (they’re just associations or mappings or whatever you want to call them).

Maybe your question boils down to asking something more general like: what’s the difference between functions to a computer scientist (or a programmer) and functions to a mathematician? That is, are ‘functions’ in C (or lambda calculus), say, the same ‘functions’ we talk about in calculus?

The answer to that is: in this case, because these are quite simple functions (sums and products and compositions thereof) they’re the same. In general, they’re a bit different. The difference is basically the difference between functional programming and ‘traditional’ programming. If you have state/‘side effects’ of functions, then your function won’t be a function in the sense of mathematics; if the return value of your function depends entirely on the input and doesn’t return different values depending on whatever else is happening in the program, then it will be.

Since you’re asking about lambda calculus in particular, the answer is that they’re the same because lambda calculus doesn’t have state. It’s ‘purely functional’ in that sense.

>I guess my question, is what are the primitive functions here doing?

I’m not really sure what you mean. They’re doing what functions always do. Every computer program is abstractly a (partial) function.

Does that help, or have I misunderstood?

replies(1): >>40217668 #

4. andoando ◴[30 Apr 24 23:20 UTC] No.40217668{3}[source]▶

>>40217296 #

So when I think of functions in lambda calculus, I think of the I,S,K functions which when composed can produce functions like "copy", "add", "remove", "if", etc which then can do different computations like "copy every other symbol if the symbol is true", "multiply 5 times then add 2". Since lambda calculus is complete, any computation/program can be composed.

When I think of functions in a traditional mathematical sense, I think about transformations of numbers. x->2x, x->2x^2, etc. I completely understand composition of functions here, ex x->2(x->2x)^2, but its unclear how these transformations relate to computation. For a regression problem, I can totally understand how finding the right compositions of functions can lead to a better approximations. So I am wondering, in an LLM architecture, what computations do these functions actually represent? I assume, it has something to do with what path to take through the neural layers. I probably just need to take the time to study it deeper.

>If you have state/‘side effects’ of functions, then your function won’t be a function in the sense of mathematics; if the return value of your function depends entirely on the input and doesn’t return different values depending on whatever else is happening in the program, then it will be.

Totally understood from the perspective of functions in say, Java. Though fundamentally I don't think there is distinction between functions in computer science and mathematics. The program as a whole is effectively a function. The "global" state is from another reference, just local variables of the encompassing function. If a function is modifying variables outside of the "function block" (in say Java), the "input" to the function isn't just the parameters of the function. Imo, this is more of an artifact of implementation of some languages rather than a fundamental difference. Python for example requires declaring global args in the function block. Go one step further and require putting global args into the parameters list and you're pretty close to satisfying this.

replies(1): >>40217897 #

5. xanderlewis ◴[30 Apr 24 23:52 UTC] No.40217897{4}[source]▶

>>40217668 #

I think you’re actually massively overthinking it.

The state of a neural network is described entirely by its parameters, which usually consist of a long array (well, a matrix, or a tensor, or whatever…) of floating point numbers. What is being optimised when a network is trained is these parameters and nothing else. When you evaluate a neural network on some input (often called performing ‘inference’), that is when the functions we’re talking about are used. You start with the input vector, and you apply all of those functions in order and you get the output vector of the network. The training process also uses these functions, because to train a network you have to perform evaluation repeatedly in between tweaking those parameters to make it better approximate the desired output for each input. Importantly, the functions do not change. They are constant; it’s the parameters that change. The functions are the architecture — not the thing being learned. Essentially what the parameters represent is how likely each neuron is to be activated (have a high value) if others in the previous layer are. So you can think of the parameters as encoding strengths of connections between each pair of neurons in consecutive layers. Thinking about ‘what path to take through the neural layers’ is way too sophisticated — it’s not doing anything like that.

> Though fundamentally I don't think there is distinction between functions in computer science and mathematics. The program as a whole is effectively a function.

You’re pretty much right about that, but there are two important problems/nitpicks:

(1) We can’t prove (in general) that a given program will halt and evaluate to something (rather than just looping forever) on a given input, so the ‘entire program’ is instead what’s called a partial function. This means that it’s still a function on its domain — but we can’t know what its precise domain is. Given an input, it may or may not produce an output. If it does, though, it’s well defined because it’s a deterministic process.

(2) You’re right to qualify that it’s the whole program that is (possibly) a function. If you take a function from some program that depends on some state in that same program, then clearly that function won’t be a proper ‘mathematical’ function. Sure, if you incorporate that extra state as one of your inputs, it might be, but that’s a different function. You have to remember that in mathematics, unlike in programming, a function consists essentially of three pieces of data: a domain, a codomain, and a ‘rule’. If you want to be set-theoretic and formal about it, this rule is just a subset of the cartesian product of its domain and codomain (it’s a set of pairs of the form (x, f(x))). If you change either of these sets, it’s technically a different function and there are good reasons for distinguishing between these. So it’s not right to say that mathematical functions and functions in a computer program are exactly the same.

replies(1): >>40218456 #

6. andoando ◴[01 May 24 01:15 UTC] No.40218456{5}[source]▶

>>40217897 #

I appreciate your responses, sorry I hope I don't seem like Im arguing for the sake of arguing.

>Essentially what the parameters represent is how likely each neuron is to be activated (have a high value) if others in the previous layer are. So you can think of the parameters as encoding strengths of connections between each pair of neurons in consecutive layers. Thinking about ‘what path to take through the neural layers’ is way too sophisticated — it’s not doing anything like that.

Im a little confused. The discussion thus far about how neural networks are essentially just compositions of functions, but you are now saying that the function is static, and only the parameters change.

But that aside, if these parameters change which neurons are activated, and this activation affects which neurons are activated in the next layer, are these parameters effectively not changing the path taken through the layers?

>Sure, if you incorporate that extra state as one of your inputs, it might be, but that’s a different function.

So say we have this program " let c = 2; function 3sum (a,b) { return a+b + c; } let d = 3sum(3,4)"

I believe you are saying, if we had constructed this instead as

"function(a,b,c) { return a+b+c } let d = 3sum(3,4,2) "

then, this is a different function.

Certainly, these are different in a sense, but at a fundamental level, when you compile this all down and run it, there is an equivalency in the transformation that is happening. That is, the two functions equivalently take some input state A (composed of a,b,c) and return the same output state B, while applying the same intermediary steps (add a to b, add c to result of (add to b)). Really, in the first case where c is defined outside the scope of the function block, the interpreter is effectively producing the function 3sum(x,y,c) as it has to at some point, one way or another, inject c into a+b+c.

Similarly, I am won't argue that the current, formal definitions of functions in mathematics are exactly that of functions as they're generally defined in programming.

Rather, what I saying is that there is an equivalent way to think and study functions that equally apply to both fields. That is, a function is simply a transformation from A to B, where A and B can be anything, whether that is bits, numbers, or any other construction in any system. The only primitive distinction to make here is whether A and B are the same thing or different.

↑