edit: to clarify, it may be possible to get this loss back and there is reason to be optimistic
Unclear why you think that since experiments show the opposite.
In general the gradient seems to get too "bumpy" to do good gradient decent at lower levels of precision.
There are some papers showing that making the training loop aware of quantitization can help ultimate quantizied performance but I'm not aware of this being implemented at large scale.
Forgive my lack of proper terminology here.
However you can reduce memory by doing mixed-precision training if you are careful. See section "2.3.1. Loss Scaling To Preserve Small Gradient Magnitudes" in https://docs.nvidia.com/deeplearning/performance/mixed-preci...
I'm not quite sure I understand what they are describing in 2.3.1, are they scaling those small gradient magnitudes larger to try to "pull" you into those holes faster?
I was thinking the a way to go about it would be to just increase the "mesh resolution" near the small hole, which in this case would be use a larger precision in the area local to the hole.
It might be possible to work around this by estimating the gradient volatility through the n^th order derivatives, but you would then also have to deal with mixed precision SIMD which hardware doesn't really support.