Backpropagation is a leaky abstraction (2016)

All first order methods use the gradient or Jacobian of a function. Calculating the first order derivatives is really cheap.

Non-stochastic gradient descent has to optimize over the full dataset. This doesn't matter for non-machine learning applications, because often there is no such thing as a dataset in the first place and the objective has a small fixed size. The gradient here is exact.

With stochastic gradient descent you're turning gradient descent into an online algorithm, where you process a finite subset of the dataset at a time. Obviously the gradient is no longer exact, you still have to calculate it though.

Seems like "exactness" is not that useful of a property for optimization. Also, I can't stress it enough, but calculating first order derivatives is so cheap there is no need to bother. It's roughly 2x the cost of evaluating the function in the first place.

It's second order derivatives that you want to approximate using first order derivatives. That's how BFGS and Gauss-Newton work.