Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

1. wwarner ◴[21 May 24 15:38 UTC] No.40429827[source]▶

huge. the activation scan, which looks for which nodes change the most when prompted with the words "Golden Gate Bridge" and later an image of the same bridge, is eerily reminiscent of a brain scan under similar prompts...

replies(2): >>40429890 #>>40429981 #

2. verdverm ◴[21 May 24 15:41 UTC] No.40429890[source]▶

>>40429827 (TP) #

I find this outcome expected and not really surprising, more confirmation of previous results. Consider vision transformers and the papers that showed what each layer was focused on.

replies(1): >>40430291 #

3. ◴[21 May 24 15:47 UTC] No.40429981[source]▶

>>40429827 (TP) #

4. wwarner ◴[21 May 24 16:07 UTC] No.40430291[source]▶

>>40429890 #

well that's exactly the point -- no such result is available for language models.

replies(1): >>40430365 #

5. verdverm ◴[21 May 24 16:12 UTC] No.40430365{3}[source]▶

>>40430291 #

There are multiple papers and efforts that have inspected the internal state of LLMs. One could even see the word2vec analysis along these lines, as evidence that the model is specializing neurons

One such example: The Internal State of an LLM Knows When It's Lying (https://arxiv.org/abs/2304.13734)

Searching phrases like "llm interpretability" and "llm activation analysis" uncover more

https://github.com/JShollaj/awesome-llm-interpretability

replies(1): >>40430816 #

6. wwarner ◴[21 May 24 16:48 UTC] No.40430816{4}[source]▶

>>40430365 #

Yes, lots of activity in the space. I thought you were saying it was a dumb problem, but I was wrong.

I think this is a great paper.

replies(1): >>40431476 #

7. verdverm ◴[21 May 24 17:44 UTC] No.40431476{5}[source]▶

>>40430816 #

yup, if you look at drop out, what it does and why, you can see additional interesting results along these lines

(drop-out was found to increase resilience in models because they had to encode information in the weights differently, i.e. could not rely on single neuron (at the limit))

replies(1): >>40432655 #

8. wwarner ◴[21 May 24 19:21 UTC] No.40432655{6}[source]▶

>>40431476 #

I suppose, except that for a model of 7B parameters, the number of combinations of dropout that you'd be analyzing is 7B factorial. More importantly, dropout has loss minimization to guide it during training, whereas understanding how a model changes when you edit a few weights is a very broad question.

replies(1): >>40432848 #

9. verdverm ◴[21 May 24 19:37 UTC] No.40432848{7}[source]▶

>>40432655 #

the analysis is more akin to analyzing with & without dropout, where a common number is to drop a random 50% of connections during a pass for training, thus forcing the model to not rely on specific nodes or connections

When you look at a specific input, you can look to see what gets activated or not. Orthogonal but related ideas for inspecting the activations to see effects