Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

(transformer-circuits.pub)

168 points 1wheel | 1 comments | 21 May 24 15:15 UTC | HN request time: 0.337s | source

Show context

whimsicalism ◴[21 May 24 15:28 UTC] No.40429692[source]▶

I continue to be impressed by Anthropic’s work and their dual commitment to scaling and safety.

HN is often characterized by a very negative tone related to any of these developments, but I really do feel that Anthropic is trying to do a “race to the top” in terms of alignment, though it doesn’t seem like all the other major companies are doing enough to race with them.

Particularly frustrating on HN is the common syllogism of: 1. I believe anything that “thinks” must do X thing. 2. LLM doesn’t do X thing 3. LLM doesn’t think

X thing is usually both poorly justified as constitutive of thinking (usually constitutive of human thinking but not writ large) nor is it explained why it matters whether the label of “thinking” applies to LLM or not if the capabilities remain the same.

replies(2): >>40429967 #>>40430005 #

phyalow ◴[21 May 24 15:48 UTC] No.40430005[source]▶

>>40429692 #

Alot of this really isnt new, Andrej Karpathy covered the principles here 8 years ago for CS231n at Stanford https://youtu.be/yCC09vCHzF8&t=1640

replies(2): >>40430130 #>>40430252 #

1. whimsicalism ◴[21 May 24 15:56 UTC] No.40430130[source]▶

>>40430005 #

neural probing has been around for a while, true - and this result is definitely building on past results. it’s basically just a scaled up version of their paper from a little while ago anywho

but Karpathy was looking at very simple LSTMs of 1-3 layers, looking at individual nodes/cells, and these results have generally thus far been difficult to replicate among large scale transformers. Karpathy also doesn’t provide a recipe for doing this in his paper, which makes me think he was just guess and checking various cells. The representations discovered are very simple

↑