←back to thread

168 points 1wheel | 1 comments | | HN request time: 0.239s | source
Show context
pdevr ◴[] No.40429851[source]
So, to summarize:

>Used "dictionary learning"

>Found abstract features

>Found similar/close features using distance

>Tried amplifying and suppressing features

Not trying to be snary, but sounds mundane in the ML/LLM world. Then again, significant advances have come from simple concepts. Would love to hear from someone who has been able to try this out.

replies(1): >>40431136 #
1. sjkoelle ◴[] No.40431136[source]
the interesting advance in the anthropic/mats research program is the application of dictionary learning to the "superpositioned" latent representations of transformers to find more "interpretable" features. however, "interpretability" is generally scored by the explainer/interpreter paradigm which is a bit ad hoc, and true automated circuit discovery (rather than simple concept representation) is still a bit off afaik.