Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

(transformer-circuits.pub)

168 points 1wheel | 3 comments | 21 May 24 15:15 UTC | HN request time: 0.611s | source

Show context

optimalsolver ◴[21 May 24 15:10 UTC] No.40429473[source]▶

>what the model is "thinking" before writing its response

An actual "thinking machine" would be constantly running computations on its accumulated experience in order to improve its future output and/or further compress its sensory history.

An LLM is doing exactly nothing while waiting for the next prompt.

replies(5): >>40429486 #>>40429493 #>>40429606 #>>40429761 #>>40429847 #

byteknight ◴[21 May 24 15:12 UTC] No.40429493[source]▶

>>40429473 #

I disagree with this. That suggests that thinking requires persistent, malleable and non-static memory. That is not the case. You can reasonably reason about without increasing knowledge if you have a base set of logic.

I think the thing you were looking for was more along the lines of a persistent autonomous agent.

replies(2): >>40429744 #>>40430109 #

HarHarVeryFunny ◴[21 May 24 15:55 UTC] No.40430109[source]▶

>>40429493 #

Sure you can reason over a fixed "base set of logic", although there's another word for that - an expert system with a fixed set of rules, which IMO is really the right way to view an LLM.

Still, what current LLMs are doing with their fixed rules is only a very limited form of reasoning since they just use a fixed N-steps of rule application to generate each word. People are looking to techniques such "group of experts" prompting to improve reasoning - step-wise generate multiple responses then evaluate them and proceed to next step.

replies(1): >>40430171 #

whimsicalism ◴[21 May 24 15:59 UTC] No.40430171[source]▶

>>40430109 #

if you zoom in enough, all thinking is an expert system with a fixed set of rules.

replies(2): >>40430181 #>>40431190 #

HarHarVeryFunny ◴[21 May 24 17:18 UTC] No.40431190[source]▶

>>40430171 #

That the basis of it, but in our brain the "inference engine" using those rules is a lot more than a fixed N-steps - there is thalamo-cortical looping, working memory of various durations, and maybe a bunch of other mechanisms such as analogical recall, resonance-based winner-takes-all processing, etc, etc.

Current LLMs have none of that - they are just the fixed set of rules, further limited by also having a fixed number of steps of rule application.

replies(1): >>40431884 #

whimsicalism ◴[21 May 24 18:18 UTC] No.40431884[source]▶

>>40431190 #

Yes, LLMs don't have regression and that is a significant limitation - although they do have something close, by decoding one token they get to then have a thought loop. They just can't loop without outputting.

replies(1): >>40432360 #

HarHarVeryFunny ◴[21 May 24 18:54 UTC] No.40432360[source]▶

>>40431884 #

Well, not exactly a loop. They get to "extend the thought", but there is zero continuity from one word to the next (LLM starts from scratch for each token generated).

The effect is as if you had multiple people playing a game where they each extend a sentence by taking turns adding a word to it, but there is zero continuity from one word to the next because each person is starting from scratch when it is their turn.

replies(1): >>40433168 #

whimsicalism ◴[21 May 24 20:03 UTC] No.40433168[source]▶

>>40432360 #

> LLM starts from scratch for each token generated

What do you mean? They get to access their previous hidden states in the next greedy decode using attention, it is not simply starting from scratch. They can access exactly what they were thinking when they put out the previous word, not just reasoning from the word itself.

replies(1): >>40434011 #

1. HarHarVeryFunny ◴[21 May 24 21:12 UTC] No.40434011[source]▶

>>40433168 #

There's the KV cache kept from one word to the next, but isn't that just an optimization ?

replies(1): >>40434448 #

2. whimsicalism ◴[21 May 24 21:46 UTC] No.40434448[source]▶

>>40434011 (TP) #

Yes, the 'KV cache' (imo an invented novelty, everyone was doing this before they came up with a term to make it sound cool) is an optimization so that you don't have to recompute what the model was thinking when it was generating all the prior words every time you decode a new word.

But that's exactly what I'm saying - the model has access to what it was thinking when it generated the previous words, it does not start from scratch. If you don't have the KV cache, you still have to regenerate what it was thinking from the previous words so on the next word generation you can look back at what you were thinking from the previous words. Does that make sense? I'm not great at talking about this stuff in words

replies(1): >>40435222 #

3. HarHarVeryFunny ◴[21 May 24 23:05 UTC] No.40435222[source]▶

>>40434448 #

I don't think you can really say it "regenerates" what it was thinking from the last prompt, since the new prompt is different from the previous one (it has the new word appended to the end, which may change the potential meanings of the sentence).

There will be some overlap in what the model is now "thinking" (and has calculated from scratch) since the new prompt is one possible continuation of the previous one, but other things it was previously "thinking" will no longer be there.

e.g. Say the prompt was "the man", and output probabilities include "in" and "ran", reflecting the model thinking of potential continuations such as "the man in the corner" and "the man ran for mayor". Suppose the word sampled was "ran", so now the new prompt is "the man ran". Possible continuations can no longer include refining who the subject is, since the new word "ran" implies the continuation must now be an action.

There is some work that has been saved, per the KV cache, in processing the new prompt, but that is only things (self attention among the common part of the two prompts) that would not change if recalculated. What the model is thinking has changed, and will continue to change depending on the next sampled continuation ("the man ran for mayor", "the man ran for cover", "the man ran his bath", etc).

↑