Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

(transformer-circuits.pub)

168 points 1wheel | 1 comments | 21 May 24 15:15 UTC | HN request time: 0s | source

Show context

e63f67dd-065b ◴[22 May 24 02:29 UTC] No.40436757[source]▶

I find Anthorpic's work on mech interp fascinating in general. Their initial towards monosemanticity paper was highly surprising, and so is this with the ability to scale to a real production-scale LLM.

My observation is, and this may be more philosophical than technical: this process of "decomposing" middle-layer activations with a sparse autoencoder -- is it capturing accurately underlying features in the latent space of the network, or are we drawing order from chaos, imposing monosemanticity where there aren't any? Or to put it another way, were the features always there, learnt by training, or are we doing post-hoc rationalisations -- where the features exist because that's how we defined the autoencoders' dictionaries, and we learn only what we wanted to learn? Are the alien minds of LLMs truly also operating on a similar semantic space as ours, or are we reading tea leaves and seeing what we want to see?

Maybe this distinction doesn't even make sense to begin with; concepts are made by man, if clamping one of these features modifies outputs in a way that is understandable to humans, it doesn't matter if it's capturing some kind of underlying cluster in the latent space of the model. But I do think it's an interesting idea to ponder.

replies(3): >>40436897 #>>40437384 #>>40438098 #

refulgentis ◴[22 May 24 02:56 UTC] No.40436897[source]▶

>>40436757 #

I'm allergic to latent space because I've yet to find any meaning to it beyond poetics, I develop an acute allergy when it's explicitly related to visually dimensional ideas like clustering.

I'll make a probably bad analogy: does your mindmap place things near each other like my mindmap?

To which I'd say, probably not, mindmaps are very personal, and the more complex we put on ours, the more personal and arbitrary they would be, and the less import the visuals would have

ex. if we have 3 million things on both our mindmaps, it's peering too closely to wonder why you put mcdonalds closer to kids food than restaurants, and you have restaurants in the top left, whereas I put it closer to kids foods, in the top mid left.

replies(4): >>40438743 #>>40438944 #>>40441108 #>>40449058 #

1. canjobear ◴[22 May 24 14:00 UTC] No.40441108[source]▶

>>40436897 #

Neural network representation spaces seem to converge, regardless of architecture: https://arxiv.org/abs/2405.07987

It would make sense for the human mental latent spaces to also converge. The reason is that the latent space exists to model the environment, which is largely shared among humans.

↑