Alice's adventures in a differentiable wonderland

Ya so far this is the best introduction to neural networks from first principles that I've seen.

Quickly skimming the draft pdf at https://arxiv.org/pdf/2404.17625 I can grok it instantly, because it's written in familiar academic language instead of gobbledygook. Anyone with an undergrad math education in engineering, computer science, etc or a self-taught equivalent understanding of differential equations should be able to read it easily. It does a really good job of connecting esoteric terms like tensors with arrays, gradients with partial derivatives, Jacobians with gradients and backpropagation with gradient descent in forward/reverse mode automatic differentiation. Which helps the reader to grasp the fundamentals instead of being distracted by the implementation details of TensorFlow, CUDA, etc. Some notable excerpts:

Introduction (page 4):

  By viewing neural networks as simply compositions of differentiable primitives we can ask two basic questions (Figure F.1.3): first, what data types can we handle as inputs or outputs? And second, what sort of primitives can we use? Differentiability is a strong requirement that does not allow us to work directly with many standard data types, such as characters or integers, which are fundamentally discrete and hence discontinuous. By contrast, we will see that differentiable models can work easily with more complex data represented as large arrays (what we will call tensors) of numbers, such as images, which can be manipulated algebraically by basic compositions of linear and nonlinear transformations.

Chapter 2.2 Gradients and Jacobians (page 23):

  [just read this section - it connects partial derivatives, gradients, Jacobians and Taylor’s theorem - wow!]

Chapter 4.1.5 Some computational considerations (page 59):

  In general, we will always prefer algorithms that scale linearly both in the feature dimension c and in the batch size n, since super-linear algorithms will become quickly impractical (e.g., a batch of 32 RGB images of size 1024×1024 has c ≈ 1e7). We can avoid a quadratic complexity in the equation of the gradient by computing the multiplications in the correct order, i.e., computing the matrix-vector product Xw first. Hence, pure gradient descent is linear in both c and n, but only if proper care is taken in the implementation: generalizing this idea is the fundamental insight for the development of reverse-mode automatic differentiation, a.k.a. back-propagation (Section 6.3).

Chapter 6 Automatic differentiation (page 87):

  We consider the problem of efficiently computing gradients of generic computational graphs, such as those induced by optimizing a scalar loss function on a fully-connected neural network, a task called automatic differentiation (AD) [BPRS18]. You can think of a computational graph as the set of atomic operations (which we call primitives) obtained by running the program itself. We will consider sequential graphs for brevity, but everything can be easily extended to more sophisticated, acyclic computational graphs.
  
  The problem may seem trivial, since the chain rule of Jacobians (Section 2.2, (E.2.22)) tells us that the gradient of function composition is simply the matrix product of the corresponding Jacobian matrices. However, efficiently implementing this is the key challenge, and the resulting algorithm (reverse-mode AD or backpropagation) is a cornerstone of neural networks and differentiable programming in general [GW08, BR24]. Understanding it is also key to understanding the design (and the differences) of most frameworks for implementing and training such programs (such as TensorFlow or PyTorch or JAX). A brief history of the algorithm can be found in [Gri12].

Edit: I changed Chapter 2.2.3 Jacobians (page 27) to Chapter 2.2 Gradients and Jacobians (page 23) for better context.