←back to thread

346 points swatson741 | 1 comments | | HN request time: 0.197s | source
Show context
drivebyhooting ◴[] No.45788473[source]
I have a naive question about backprop and optimizers.

I understand how SGD is just taking a step proportional to the gradient and how backprop computes the partial derivative of the loss function with respect to each model weight.

But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.

So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?

replies(11): >>45788484 #>>45788525 #>>45788559 #>>45788604 #>>45789234 #>>45789345 #>>45789485 #>>45792574 #>>45792998 #>>45796071 #>>45799773 #
1. ainch ◴[] No.45792998[source]
That's an interesting idea, it sounds similar to the principles behind low precision models like BitNet (where each weight is +-1 or 0).

That said, I know Deepseek use fp32 for their gradient updates even though they use fp8 for inference. And a recent paper shows that RL+LLM training is shakier at bf16 than fp16, which would both imply that numerical precision in gradients still matters.