Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

(transformer-circuits.pub)

168 points 1wheel | 1 comments | 21 May 24 15:15 UTC | HN request time: 0.202s | source

Show context

whimsicalism ◴[21 May 24 15:28 UTC] No.40429692[source]▶

I continue to be impressed by Anthropic’s work and their dual commitment to scaling and safety.

HN is often characterized by a very negative tone related to any of these developments, but I really do feel that Anthropic is trying to do a “race to the top” in terms of alignment, though it doesn’t seem like all the other major companies are doing enough to race with them.

Particularly frustrating on HN is the common syllogism of: 1. I believe anything that “thinks” must do X thing. 2. LLM doesn’t do X thing 3. LLM doesn’t think

X thing is usually both poorly justified as constitutive of thinking (usually constitutive of human thinking but not writ large) nor is it explained why it matters whether the label of “thinking” applies to LLM or not if the capabilities remain the same.

replies(2): >>40429967 #>>40430005 #

phyalow ◴[21 May 24 15:48 UTC] No.40430005[source]▶

>>40429692 #

Alot of this really isnt new, Andrej Karpathy covered the principles here 8 years ago for CS231n at Stanford https://youtu.be/yCC09vCHzF8&t=1640

replies(2): >>40430130 #>>40430252 #

kalkin ◴[21 May 24 16:04 UTC] No.40430252[source]▶

>>40430005 #

This is an illustrative comment for meta reasons, I think. Karpathy's lecture almost certainly doesn't cover the superposition hypothesis (which hadn't been invented for ANNs 8 years ago), or sparse dictionary learning (whose application to ANNs is motivated by the superposition hypothesis). It certainly doesn't talk about actual specific features found in post-ChatGPT language models. What's happening here seems like a thing LLMs are often accused of dismissively - you're pattern-matching to certain associated words without really reasoning about what is or isn't new in this paper.

I worry this is going to come across as insulting, but that's not my intention. I do this too sometimes; I think everyone does. The point is we shouldn't define true reasoning so narrowly that we think no system capable of it would ever be caught doing what most of us are in fact doing most of the time.

replies(2): >>40431101 #>>40434732 #

1. ben_w ◴[21 May 24 22:16 UTC] No.40434732[source]▶

>>40430252 #

> I worry this is going to come across as insulting, but that's not my intention. I do this too sometimes; I think everyone does. The point is we shouldn't define true reasoning so narrowly that we think no system capable of it would ever be caught doing what most of us are in fact doing most of the time.

Indeed; to me LLMs pattern match (yes, I did spot the irony) to system-1 thinking, and they do a better job of that than we humans do.

Fortunately for all of us, they're no good at doing system-2 thinking themselves, and only mediocre at translating problems into a form which can be used by a formal logic system that excels at system-2 thinking.

↑