(transformer-circuits.pub)

168 points 1wheel | 1 comments | 21 May 24 15:15 UTC | HN request time: 0s | source

Show context

pdevr ◴[21 May 24 15:39 UTC] No.40429851[source]▶

>>40429540 (OP) #

So, to summarize:

>Used "dictionary learning"

>Found abstract features

>Found similar/close features using distance

>Tried amplifying and suppressing features

Not trying to be snary, but sounds mundane in the ML/LLM world. Then again, significant advances have come from simple concepts. Would love to hear from someone who has been able to try this out.

replies(1): >>40431136 #

1. sjkoelle ◴[21 May 24 17:13 UTC] No.40431136[source]▶

>>40429851 #

the interesting advance in the anthropic/mats research program is the application of dictionary learning to the "superpositioned" latent representations of transformers to find more "interpretable" features. however, "interpretability" is generally scored by the explainer/interpreter paradigm which is a bit ad hoc, and true automated circuit discovery (rather than simple concept representation) is still a bit off afaik.

↑

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet