Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

This is exceptionally cool. Not only is it very interesting to see how this can be used to better understand and shape LLM behavior, I can’t help but also think it’s an interesting roadmap to human anthropology.

If we see LLMs as substantial compressed representations of human knowledge/thought/speech/expression—and within that, a representation of the world around us—then dictionary concepts that meaningfully explain this compressed representation should also share structure with human experience.

I don’t mean to take this canonically, it’s representations all the way down, but I can’t help but wonder what the geometry of this dictionary concept space says about us.