(karpathy.medium.com)

346 points swatson741 | 1 comments | 02 Nov 25 05:20 UTC | HN request time: 0s | source

Show context

drivebyhooting ◴[02 Nov 25 07:21 UTC] No.45788473[source]▶

I have a naive question about backprop and optimizers.

I understand how SGD is just taking a step proportional to the gradient and how backprop computes the partial derivative of the loss function with respect to each model weight.

But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.

So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?

replies(11): >>45788484 #>>45788525 #>>45788559 #>>45788604 #>>45789234 #>>45789345 #>>45789485 #>>45792574 #>>45792998 #>>45796071 #>>45799773 #

1. macleginn ◴[02 Nov 25 10:45 UTC] No.45789345[source]▶

>>45788473 #

It is possible to compute the approximate gradient (direction to step) without using the formulas: we can change the value of each parameter individually, compute the loss, set the values of all parameters in such a way that the loss is minimized, and then repeat. This means, however, that we have to do number-of-parameters forward passes for one optimization step, which is very expensive. With formulas, we can compute all these values in one backward pass.

↑

Backpropagation is a leaky abstraction (2016)