Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

1. kromem ◴[22 May 24 04:40 UTC] No.40437406[source]▶

Great work as usual.

I was pretty upset seeing the superalignment team dissolve at OpenAI, but as is typical for the AI space, the news of one day was quickly eclipsed by the next day.

Anthropic are really killing it right now, and it's very refreshing seeing their commitment to publishing novel findings.

I hope this finally serves as the nail in the coffin on the "it's just fancy autocomplete" and "it doesn't understand what it's saying, bro" rhetoric.

replies(3): >>40437912 #>>40441380 #>>40445446 #

2. jwilber ◴[22 May 24 06:28 UTC] No.40437912[source]▶

>>40437406 (TP) #

Love Anthropic research. Great visuals between Olah, Carter, and Pearce, as well.

I don’t think this paper does much in the way of your final point, “it doesn’t understand what it’s saying”, though our understanding certainly has improved.

replies(1): >>40438375 #

3. kromem ◴[22 May 24 07:47 UTC] No.40438375[source]▶

>>40437912 #

They were able to demonstrate conceptual vectors that were consistent across different languages and different mediums (text vs images) and that when manipulated were able to represent the abstract concept in the output regardless of prompt.

What kind of evidentiary threshold would you want if that's not sufficient?

replies(1): >>40444944 #

4. Workaccount2 ◴[22 May 24 14:21 UTC] No.40441380[source]▶

>>40437406 (TP) #

> on the "it's just fancy autocomplete" and "it doesn't understand what it's saying, bro" rhetoric.

No matter what, there will always be a group of people saying that. The power and drive of the brain to convince itself that it is weaved of magical energy on a divine substrate shouldn't be underestimated. Especially when media plays so hard into that idea (the robots that lose the war because they cannot overcome love, etc.) because brains really love being told they are right.

I am almost certain that the first conscious silicon (or whatever material) will be subjected to immense suffering until a new generation that can accept the human brains banality can move things forward.

replies(1): >>40446059 #

5. jwilber ◴[22 May 24 19:12 UTC] No.40444944{3}[source]▶

>>40438375 #

My point is that you claimed this is a rebuff against those claiming models don’t understand themselves. Your interpretation seems to assign intelligence to the algorithms.

While this research allows us to interpret larger models in an amazing way, it doesn’t mean the models themselves ‘understand’ anything.

You can use this on much smaller scale models as well, as they showed 8 months ago. Does that research tell us about how models understand themselves? Or does it help us understand how the models work?

replies(1): >>40446963 #

6. astrange ◴[22 May 24 19:45 UTC] No.40445446[source]▶

>>40437406 (TP) #

I think the research is good, but it's disappointing that they hype it by claiming it's going to help their basically entirely fictional "AI safety" project, as if the bits in their model are going to come alive and eat them.

replies(1): >>40446085 #

7. ben_w ◴[22 May 24 20:29 UTC] No.40446059[source]▶

>>40441380 #

It tickles me somewhat to note that people using the phrase "stochastic parrot" are demonstrating in themselves the exact behaviour for which they are dismissive of the LLMs.

> I am almost certain that the first conscious silicon (or whatever material) will be subjected to immense suffering until a new generation that can accept the human brains banality can move things forward.

Indeed, though as we don't know what we're doing (and have 40 definitions of "consciousness" and no way to test for qualia), I would add that the first AI we make with these properties, will likely suffer from every permutation of severe and mild mental heath disorder that is logically possible, including many we have no word for because they would be incompatible with life if found in an organic brain.

8. ben_w ◴[22 May 24 20:31 UTC] No.40446085[source]▶

>>40445446 #

We just had a pandemic made from a non-living virus that was basically trying to eat us. To riff off the quote:

The virus does not hate you, nor does it love you, but you are made of atoms which it can use for something else.

replies(1): >>40446435 #

9. astrange ◴[22 May 24 20:56 UTC] No.40446435{3}[source]▶

>>40446085 #

Non-living isn't a great way to describe a virus because they certainly become part of a living system once they get in your cells.

Models don't do that though, only if you run them in a loop with tools they can call, so mostly don't do that.

replies(1): >>40447305 #

10. kromem ◴[22 May 24 21:40 UTC] No.40446963{4}[source]▶

>>40444944 #

"Understand themselves" is a very different thing than "understand what they are saying."

Which exactly are we talking about here?

Because no, the research doesn't say much about the former, but yes, it says a lot about the latter, especially on top of the many, many earlier papers working in smaller toy models demonstrating world modeling.

11. ben_w ◴[22 May 24 22:09 UTC] No.40447305{4}[source]▶

>>40446435 #

> Models don't do that though, only if you run them in a loop with tools they can call, so mostly don't do that.

That's also a description of DNA and RNA. They're chemicals, not magic.

And there's loads of people all too eager to put any and every AI they find into such an environment[0], then connect it to a robot body[1], or connect it to the internet[2], just to see what happens. Or have an AI or algorithm design T-shirts[3] for them or trade stocks[4][5][6] for them because they don't stop and think about how this might go wrong.

[0] https://community.openai.com/t/chaosgpt-an-ai-that-seeks-to-...

[1] https://www.microsoft.com/en-us/research/group/autonomous-sy...

[2] https://platform.openai.com/docs/api-reference

[3] https://www.theguardian.com/technology/2013/mar/02/amazon-wi...

[4] https://intellectia.ai/blog/chatgpt-for-stock-trading

[5] https://en.wikipedia.org/wiki/Algorithmic_trading

[6] https://en.wikipedia.org/wiki/2007–2008_financial_crisis

replies(1): >>40447420 #

12. astrange ◴[22 May 24 22:21 UTC] No.40447420{5}[source]▶

>>40447305 #

Those can certainly cause real problems. I just feel that to find the solutions to those problems, we have to start with real concrete issues and find the abstractions from there.

I don't think "AI safety" is the right abstraction because it came from the idea that AI would start off as an imaginary agent living in a computer that we'd teach stuff to. Whereas what we actually have is a giant pretrained blob that (unreliably) emits text when you run other text through it.

Constrained decoding (like forcing the answer to conform to JSON grammar) is an example of a real solution, and past that it's mostly the same as other software security.

replies(1): >>40447678 #

13. ben_w ◴[22 May 24 22:44 UTC] No.40447678{6}[source]▶

>>40447420 #

> I don't think "AI safety" is the right abstraction because it came from the idea that AI would start off as an imaginary agent living in a computer that we'd teach stuff to. Whereas what we actually have is a giant pretrained blob that (unreliably) emits text when you run other text through it.

I disagree, that's simply the behaviour one of the best consumer-facing AI that gets all the air-time at the moment. (Weirdly, loads of people even here talk about AI like it's LLMs even though diffusion based image generators are also making significant progress and being targeted with lawsuits).

AI is automation — the point is to do stuff we don't want to do for whatever reason (including expense), but it does it a bit wrong. People have already died from automation that was carefully engineered but which still had mistakes; machine learning is all about letting a system engineer itself, even if you end up making a checkpoint where it's "good enough", shipping that, and telling people they don't need to train it any more… though they often will keep training it, because that's not actually hard.

We've also got plenty of agentic AI (though as that's a buzzword, bleh, lots of scammers there too), independently of the fact that it's very easy to use even an LLM (which is absolutely not designed or intended for this) as a general agent just by putting it into a loop and telling it the sort of thing it's supposed to be agentic with regards to.

Even with constrained decoding, so far as I can tell the promises are merely advert, while the reality is that's these things are only "pretty good": https://community.openai.com/t/how-to-get-100-valid-json-ans...

(But of course, this is a fast-moving area, so I may just be out of date even though that was only from a few months ago).

However, the "it's only pretty good" becomes "this isn't even possible" in certain domains; this is why, for example, ChatGPT has a disclaimer on the front about not trusting it — there's no way to know, in general, if it's just plain wrong. Which is fine when writing a newspaper column because the Gell-Mann amnesia effect says it was already like that… but not when it's being tasked with anything critical.

Hopefully nobody will use ChatGPT to plan an economy, but the point of automation is to do things for us, so some future AI will almost certainly get used that way. Just as a toy model (because it's late here and I'm tired), imagine if that future AI decides to drop everything and invest only in rice and tulips 0.001% of the time. After all, if it's just as smart as a human, and humans made that mistake…

But on the "what about humans" perspective, you can also look at the environment. I'd say there's no evil moustache twirling villains who like polluting the world, but of course there are genuinely people who do that "to own the libs"; but these are not the main source of pollution in the world, mostly it's people making decisions that seem sensible to them and yet which collectively damage the commons. Plenty of reason to expect an AI to do something that "seems sensible" to its owner, which damages the commons, even if the human is paying attention, which they're probably not doing for the same reason M3 shareholders probably weren't looking very closely to what M3 was doing — "these people are maximising my dividend payments… why is my blood full of microplastics?"