←back to thread

346 points swatson741 | 1 comments | | HN request time: 0s | source
Show context
drivebyhooting ◴[] No.45788473[source]
I have a naive question about backprop and optimizers.

I understand how SGD is just taking a step proportional to the gradient and how backprop computes the partial derivative of the loss function with respect to each model weight.

But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.

So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?

replies(11): >>45788484 #>>45788525 #>>45788559 #>>45788604 #>>45789234 #>>45789345 #>>45789485 #>>45792574 #>>45792998 #>>45796071 #>>45799773 #
1. imtringued ◴[] No.45789234[source]
All first order methods use the gradient or Jacobian of a function. Calculating the first order derivatives is really cheap.

Non-stochastic gradient descent has to optimize over the full dataset. This doesn't matter for non-machine learning applications, because often there is no such thing as a dataset in the first place and the objective has a small fixed size. The gradient here is exact.

With stochastic gradient descent you're turning gradient descent into an online algorithm, where you process a finite subset of the dataset at a time. Obviously the gradient is no longer exact, you still have to calculate it though.

Seems like "exactness" is not that useful of a property for optimization. Also, I can't stress it enough, but calculating first order derivatives is so cheap there is no need to bother. It's roughly 2x the cost of evaluating the function in the first place.

It's second order derivatives that you want to approximate using first order derivatives. That's how BFGS and Gauss-Newton work.