Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Reminds me of this paper from a couple of weeks ago that isolated the "refusal vector" for prompts that caused the model to decline to answer certain prompts:

https://news.ycombinator.com/item?id=40242939

I love seeing the work here -- especially the way that they identified a vector specifically for bad code. I've been trying to explore the way that we can use adversarial training to increase the quality of code generated by our LLMs, and so using this technique to get countering examples of secure vs. insecure code (to bootstrap the training process) is really exciting.

Overall, fascinating stuff!!