Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

(transformer-circuits.pub)

168 points 1wheel | 1 comments | 21 May 24 15:15 UTC | HN request time: 0.212s | source

Show context

e63f67dd-065b ◴[22 May 24 02:29 UTC] No.40436757[source]▶

I find Anthorpic's work on mech interp fascinating in general. Their initial towards monosemanticity paper was highly surprising, and so is this with the ability to scale to a real production-scale LLM.

My observation is, and this may be more philosophical than technical: this process of "decomposing" middle-layer activations with a sparse autoencoder -- is it capturing accurately underlying features in the latent space of the network, or are we drawing order from chaos, imposing monosemanticity where there aren't any? Or to put it another way, were the features always there, learnt by training, or are we doing post-hoc rationalisations -- where the features exist because that's how we defined the autoencoders' dictionaries, and we learn only what we wanted to learn? Are the alien minds of LLMs truly also operating on a similar semantic space as ours, or are we reading tea leaves and seeing what we want to see?

Maybe this distinction doesn't even make sense to begin with; concepts are made by man, if clamping one of these features modifies outputs in a way that is understandable to humans, it doesn't matter if it's capturing some kind of underlying cluster in the latent space of the model. But I do think it's an interesting idea to ponder.

replies(3): >>40436897 #>>40437384 #>>40438098 #

baq ◴[22 May 24 07:00 UTC] No.40438098[source]▶

>>40436757 #

> concepts are made by man

I find this statement... controversial?

The canonical example would be mathemathics - are they discovered or invented? Does the idea of '3' or an empty set or a straight line exist without any humans thinking about it or even if it is necessary to have any kind of an universe at all for these concepts to be valid? I think the answers here are 'yes' and 'no'.

Of course, there are still concepts which require grounding in the universe or humanity, but if you can think these up first (...somehow), you should need neither.

replies(2): >>40441134 #>>40445950 #

1. ben_w ◴[22 May 24 20:22 UTC] No.40445950[source]▶

>>40438098 #

It may be literally controversial[0], but I don't think it's wrong.

Yes, maths is an interesting (and open) question. But also, the rules of maths are the result of some set of axioms — it's not clear to me[1] that the axioms we have are necessarily the ones we must have, even though ours are clearly a really useful set.

We put labels onto the world to make it easier to deal with, but every time I look closer at any concept which has a physical reality associated with it, I find that it's unclear where the boundary should be.

What's a "word"? Does hyphenation or concatenation modify the boundary? What if it was concatenated in a different language and the meaning of the concatenation was loaned separately to the parts, e.g. "schadenfreude"? Was "Brexit" still a word before it was coined — and if yes then what else is, and if no then when did it become a word?

What's a "fish"? Dolphins are mammals, jellyfish have no CNS, molluscs glue themselves to a rock and digest their own brain.

What's a "species"? Not all mules are sterile.

Where's the cut-off between a fertilised human egg and a person? And on the other end, when does death happen?

What counts as "one" anglerfish, given the reproductive cycle has males attaching to and dissolving into the females?

There's only a smooth gradient with no sudden cut-offs going from dust to asteroids to minor planets to rocky planets to gas giants to brown dwarf stars.

There aren't really seven colours in the rainbow, and we have a lot more than five senses — there's not really a good reason to group "pain" and "gentle pressure" as both "touch", except to make it five.

[0] giving rise or likely to give rise to public disagreement

[1] however this is quite possibly due to me being wildly oblivious; the example I'd use is that one of Euclid's axioms turned out to be unnecessary, but so far as I am aware all the others are considered unavoidable?

↑