Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

I wonder how interpretability and training can interplay. Some examples:

Imagine taking Claude, tweaking weights relevant to X and then fine tuning it on knowledge related to X. It could result in more neurons being recruited to learn about X.

Imagine performing this during training to amplify or reduce the importance of certain topics. Train it on a vast corpus, but tune at various checkpoints to ensure the neural network's knowledge distribution skews. This could be a way to get more performance from MoE models.

I am not an expert. Just putting on my generalist hat here. Tell me I'm wrong because I'd be fascinated to hear the reasons.