Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

1. e63f67dd-065b ◴[22 May 24 02:29 UTC] No.40436757[source]▶

I find Anthorpic's work on mech interp fascinating in general. Their initial towards monosemanticity paper was highly surprising, and so is this with the ability to scale to a real production-scale LLM.

My observation is, and this may be more philosophical than technical: this process of "decomposing" middle-layer activations with a sparse autoencoder -- is it capturing accurately underlying features in the latent space of the network, or are we drawing order from chaos, imposing monosemanticity where there aren't any? Or to put it another way, were the features always there, learnt by training, or are we doing post-hoc rationalisations -- where the features exist because that's how we defined the autoencoders' dictionaries, and we learn only what we wanted to learn? Are the alien minds of LLMs truly also operating on a similar semantic space as ours, or are we reading tea leaves and seeing what we want to see?

Maybe this distinction doesn't even make sense to begin with; concepts are made by man, if clamping one of these features modifies outputs in a way that is understandable to humans, it doesn't matter if it's capturing some kind of underlying cluster in the latent space of the model. But I do think it's an interesting idea to ponder.

replies(3): >>40436897 #>>40437384 #>>40438098 #

2. refulgentis ◴[22 May 24 02:56 UTC] No.40436897[source]▶

>>40436757 (TP) #

I'm allergic to latent space because I've yet to find any meaning to it beyond poetics, I develop an acute allergy when it's explicitly related to visually dimensional ideas like clustering.

I'll make a probably bad analogy: does your mindmap place things near each other like my mindmap?

To which I'd say, probably not, mindmaps are very personal, and the more complex we put on ours, the more personal and arbitrary they would be, and the less import the visuals would have

ex. if we have 3 million things on both our mindmaps, it's peering too closely to wonder why you put mcdonalds closer to kids food than restaurants, and you have restaurants in the top left, whereas I put it closer to kids foods, in the top mid left.

replies(4): >>40438743 #>>40438944 #>>40441108 #>>40449058 #

3. kromem ◴[22 May 24 04:37 UTC] No.40437384[source]▶

>>40436757 (TP) #

Their manipulation of the vectors and the effects produced would suggest that it isn't that the SAE is just finding phantom representations that aren't really there.

4. baq ◴[22 May 24 07:00 UTC] No.40438098[source]▶

>>40436757 (TP) #

> concepts are made by man

I find this statement... controversial?

The canonical example would be mathemathics - are they discovered or invented? Does the idea of '3' or an empty set or a straight line exist without any humans thinking about it or even if it is necessary to have any kind of an universe at all for these concepts to be valid? I think the answers here are 'yes' and 'no'.

Of course, there are still concepts which require grounding in the universe or humanity, but if you can think these up first (...somehow), you should need neither.

replies(2): >>40441134 #>>40445950 #

5. anentropic ◴[22 May 24 08:37 UTC] No.40438743[source]▶

>>40436897 #

what if you averaged over millions of peoples' mindmaps?

6. TeMPOraL ◴[22 May 24 09:08 UTC] No.40438944[source]▶

>>40436897 #

Why would that matter? The absolute orientation of the mind map doesn't matter - maybe my map is actually very close to yours, subject to some rotation and mirroring?

More than that, I'd think a better 2D analogy for the latent space is a force-directed graph that you keep shaking as you add things to it. It doesn't seem unlikely for two such graphs, constructed in different order, to still end up identical in the end.

Thirdly:

> if we have 3 million things on both our mindmaps, it's peering too closely to wonder why you put mcdonalds closer to kids food than restaurants, and you have restaurants in the top left, whereas I put it closer to kids foods, in the top mid left.

In 2D analogy, maybe, but that's because of limited space. In 20 000 D analogy, there's no reason for our mind maps to meaningfully differ here; there's enough dimensions that terms can be close to other terms for any relationship you could think of.

replies(2): >>40439115 #>>40443415 #

7. marmadukester39 ◴[22 May 24 09:42 UTC] No.40439115{3}[source]▶

>>40438944 #

This sounds a bit similar to how marketers have thought of the concept of brands and how they cluster in peoples minds for a long time.

8. canjobear ◴[22 May 24 14:00 UTC] No.40441108[source]▶

>>40436897 #

Neural network representation spaces seem to converge, regardless of architecture: https://arxiv.org/abs/2405.07987

It would make sense for the human mental latent spaces to also converge. The reason is that the latent space exists to model the environment, which is largely shared among humans.

9. skybrian ◴[22 May 24 14:02 UTC] No.40441134[source]▶

>>40438098 #

People often modify their environment to make their concepts work. This is true even of counting:

https://metarationality.com/pebbles

10. refulgentis ◴[22 May 24 17:10 UTC] No.40443415{3}[source]▶

>>40438944 #

> there's no reason for our mind maps to meaningfully differ here

Yes there is.

If you think all training runs converge to the same bits given the same output size, I would again stress that the visual dimensions analogy is poetics and extremely tortured.

If you're making the weaker claim that generally concepts sort themselves into a space and they're generally sorted the same way if we have the same training data. Or rotational symmetry means any differences don't matter. Or location doesn't matter at all...we're in poetics.

Something that really sold me when I was in a similar mindset was word2vec's king - man + woman = queen wasn't actually real or in the model. Just a way of explaining it simply.

Another thought from my physics days: try visualizing 4D. Some people do claim to, after much effort, but in my experience they're unserious, i.e. I didn't see PhDs or masters students in my program claiming this. No one tries claiming they can see in 5D.

replies(2): >>40446829 #>>40447940 #

11. ben_w ◴[22 May 24 20:22 UTC] No.40445950[source]▶

>>40438098 #

It may be literally controversial[0], but I don't think it's wrong.

Yes, maths is an interesting (and open) question. But also, the rules of maths are the result of some set of axioms — it's not clear to me[1] that the axioms we have are necessarily the ones we must have, even though ours are clearly a really useful set.

We put labels onto the world to make it easier to deal with, but every time I look closer at any concept which has a physical reality associated with it, I find that it's unclear where the boundary should be.

What's a "word"? Does hyphenation or concatenation modify the boundary? What if it was concatenated in a different language and the meaning of the concatenation was loaned separately to the parts, e.g. "schadenfreude"? Was "Brexit" still a word before it was coined — and if yes then what else is, and if no then when did it become a word?

What's a "fish"? Dolphins are mammals, jellyfish have no CNS, molluscs glue themselves to a rock and digest their own brain.

What's a "species"? Not all mules are sterile.

Where's the cut-off between a fertilised human egg and a person? And on the other end, when does death happen?

What counts as "one" anglerfish, given the reproductive cycle has males attaching to and dissolving into the females?

There's only a smooth gradient with no sudden cut-offs going from dust to asteroids to minor planets to rocky planets to gas giants to brown dwarf stars.

There aren't really seven colours in the rainbow, and we have a lot more than five senses — there's not really a good reason to group "pain" and "gentle pressure" as both "touch", except to make it five.

[0] giving rise or likely to give rise to public disagreement

[1] however this is quite possibly due to me being wildly oblivious; the example I'd use is that one of Euclid's axioms turned out to be unnecessary, but so far as I am aware all the others are considered unavoidable?

12. TeMPOraL ◴[22 May 24 21:29 UTC] No.40446829{4}[source]▶

>>40443415 #

Yes, I'm making the weaker claim that concepts would generally sort themselves into roughly equivalent structures, that could be mapped to each other through some easy affine transformations (rotation, symmetry, translation, etc.) applied to various parts of the structures.

Or, in other words, I think absolute coordinates of any concept in the latent space are irrelevant and it makes no sense to compare them between two models; what matters is the relative position of concepts with respect to other concepts, and I expect the structures to be similar here for large enough datasets of real text, even if those data sets are disjoint.

(More specific prediction: take a typical LLM dataset, say Books3 or Common Crawl, randomly select half of it as dataset A, the remainder is dataset B. I expect that two models of the same architecture, one trained on dataset A, other on dataset B, should end up with structurally similar latent spaces.)

> Something that really sold me when I was in a similar mindset was word2vec's king - man + woman = queen wasn't actually real or in the model. Just a way of explaining it simply.

Huh, it seems I took the opposite understanding from word2vec: I expect that "king - man + woman = queen" should hold in most models. What I mean by structural similarity could be described as such equations mostly holding across models for a significant number of concepts.

replies(1): >>40464771 #

13. sdwr ◴[22 May 24 23:07 UTC] No.40447940{4}[source]▶

>>40443415 #

I think you are hung up on the visual representation.

Last week, the post about jailbreaking ChatGPT(?) talked about turning off a direction in possibility-space to disable the "I'm sorry, but I can't..." message.

In a regular program, it would be a boolean variable, or a single ASM instruction.

And you could ask the same thing. "How does my program have an off switch if there aren't enough values to store all possible meanings of "off"? Does my off switch variable map to your off switch variable?"

And the answer would be yes, or no, or it doesn't matter. It's a tool/construct.

14. rcxdude ◴[23 May 24 00:55 UTC] No.40449058[source]▶

>>40436897 #

I mean, it's mostly about how close concepts are to each other, and to some extent how different concepts are placed on a given axis. Of course the concept space is very high-dimensional so it's not very easy to visualise without reducing the dimensions, but because we mostly care about distance that reduction is not particularly lossy, but it does mean that top-left vs bottom right doesn't mean much, it's more that mcdonalds is usually closer to food than it is to, say, gearboxes (and that a representation that doesn't do that probably doesn't understand the concepts very well).

15. wumbo ◴[24 May 24 10:38 UTC] No.40464771{5}[source]▶

>>40446829 #

What would be an appropriate test?

- Given 2 word embedding sets,

- For each pair (A,B) of embeddings in one set,

- There exists an equivalence (A’,B’) in the other set,

- Such that dist(A,B) ≈ dist(A’, B’),

Something like that, to start. But would need to look at longer chains of relations.