> Interestingly, for models up to 1 billion parameters, the loss actually starts to increase again after reaching a minimum. This might be because as training continues, the model develops more complex, non-linear representations that our simple linear probe can’t capture as well. Or it might be that the model starts to overfit on the training data and loses its more general concept of nullability.
Double descent?