←back to thread

346 points swatson741 | 1 comments | | HN request time: 0s | source
Show context
drivebyhooting ◴[] No.45788473[source]
I have a naive question about backprop and optimizers.

I understand how SGD is just taking a step proportional to the gradient and how backprop computes the partial derivative of the loss function with respect to each model weight.

But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.

So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?

replies(11): >>45788484 #>>45788525 #>>45788559 #>>45788604 #>>45789234 #>>45789345 #>>45789485 #>>45792574 #>>45792998 #>>45796071 #>>45799773 #
blackbear_ ◴[] No.45788604[source]
Two thoughts:

> how important is computing the exact gradient using calculus

Normally the gradient is computed with a small "minibatch" of examples, meaning that on average over many steps the true gradient is followed, but each step individually never moves exacty along the true gradient. This noisy walk is actually quite beneficial for the final performance of the network https://arxiv.org/abs/2006.15081 , https://arxiv.org/abs/1609.04836 so much so that people started wondering what is the best way to "corrupt" this approximate gradient even more to improve performance https://arxiv.org/abs/2202.02831 (and many other works relating to SGD noise)

> vs just knowing the general direction to step

I can't find relevant papers now, but I seem to recall that the Hessian eigenvalues of the loss function decay rather quickly, which means that taking a step in most directions will not change the loss very much. That is to say, you have to know which direction to go quite precisely for an SGD-like method to work. People have been trying to visualize the loss and trajectory taken during optimization https://arxiv.org/pdf/1712.09913 , https://losslandscape.com/

replies(1): >>45788828 #
1. raindeer2 ◴[] No.45788828[source]
The first bit is why it is called Stochastic gradient decent. You follow the gradient of a randomly chosen minibatch at each step. It basically makes you "vibrate" down along the gradient.