←back to thread

346 points swatson741 | 1 comments | | HN request time: 0.2s | source
Show context
drivebyhooting ◴[] No.45788473[source]
I have a naive question about backprop and optimizers.

I understand how SGD is just taking a step proportional to the gradient and how backprop computes the partial derivative of the loss function with respect to each model weight.

But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.

So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?

replies(11): >>45788484 #>>45788525 #>>45788559 #>>45788604 #>>45789234 #>>45789345 #>>45789485 #>>45792574 #>>45792998 #>>45796071 #>>45799773 #
1. joe_the_user ◴[] No.45796071[source]
There are a lot of approximation methods involved in training neural networks. But the main thing that while learning calculus is a challenging, actually calculating the derivative of a function at a point using algorithmic differentiation is actually extremely fast and exact, nearly as exact as calculating the function's value itself and inherently more efficient than finite difference approximations to the derivative. Algorithmic differentiation is nearly "magic".

But remember, that is for taking the derivative at a single data point - what's hard is the average derivative over the entire set of points and that's where sampling and approximations (SGD etc)comes in.