Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

(transformer-circuits.pub)

168 points 1wheel | 2 comments | 21 May 24 15:15 UTC | HN request time: 0.458s | source

Show context

kromem ◴[22 May 24 04:40 UTC] No.40437406[source]▶

Great work as usual.

I was pretty upset seeing the superalignment team dissolve at OpenAI, but as is typical for the AI space, the news of one day was quickly eclipsed by the next day.

Anthropic are really killing it right now, and it's very refreshing seeing their commitment to publishing novel findings.

I hope this finally serves as the nail in the coffin on the "it's just fancy autocomplete" and "it doesn't understand what it's saying, bro" rhetoric.

replies(3): >>40437912 #>>40441380 #>>40445446 #

astrange ◴[22 May 24 19:45 UTC] No.40445446[source]▶

>>40437406 #

I think the research is good, but it's disappointing that they hype it by claiming it's going to help their basically entirely fictional "AI safety" project, as if the bits in their model are going to come alive and eat them.

replies(1): >>40446085 #

ben_w ◴[22 May 24 20:31 UTC] No.40446085[source]▶

>>40445446 #

We just had a pandemic made from a non-living virus that was basically trying to eat us. To riff off the quote:

The virus does not hate you, nor does it love you, but you are made of atoms which it can use for something else.

replies(1): >>40446435 #

astrange ◴[22 May 24 20:56 UTC] No.40446435[source]▶

>>40446085 #

Non-living isn't a great way to describe a virus because they certainly become part of a living system once they get in your cells.

Models don't do that though, only if you run them in a loop with tools they can call, so mostly don't do that.

replies(1): >>40447305 #

ben_w ◴[22 May 24 22:09 UTC] No.40447305[source]▶

>>40446435 #

> Models don't do that though, only if you run them in a loop with tools they can call, so mostly don't do that.

That's also a description of DNA and RNA. They're chemicals, not magic.

And there's loads of people all too eager to put any and every AI they find into such an environment[0], then connect it to a robot body[1], or connect it to the internet[2], just to see what happens. Or have an AI or algorithm design T-shirts[3] for them or trade stocks[4][5][6] for them because they don't stop and think about how this might go wrong.

[0] https://community.openai.com/t/chaosgpt-an-ai-that-seeks-to-...

[1] https://www.microsoft.com/en-us/research/group/autonomous-sy...

[2] https://platform.openai.com/docs/api-reference

[3] https://www.theguardian.com/technology/2013/mar/02/amazon-wi...

[4] https://intellectia.ai/blog/chatgpt-for-stock-trading

[5] https://en.wikipedia.org/wiki/Algorithmic_trading

[6] https://en.wikipedia.org/wiki/2007–2008_financial_crisis

replies(1): >>40447420 #

1. astrange ◴[22 May 24 22:21 UTC] No.40447420[source]▶

>>40447305 #

Those can certainly cause real problems. I just feel that to find the solutions to those problems, we have to start with real concrete issues and find the abstractions from there.

I don't think "AI safety" is the right abstraction because it came from the idea that AI would start off as an imaginary agent living in a computer that we'd teach stuff to. Whereas what we actually have is a giant pretrained blob that (unreliably) emits text when you run other text through it.

Constrained decoding (like forcing the answer to conform to JSON grammar) is an example of a real solution, and past that it's mostly the same as other software security.

replies(1): >>40447678 #

2. ben_w ◴[22 May 24 22:44 UTC] No.40447678[source]▶

>>40447420 (TP) #

> I don't think "AI safety" is the right abstraction because it came from the idea that AI would start off as an imaginary agent living in a computer that we'd teach stuff to. Whereas what we actually have is a giant pretrained blob that (unreliably) emits text when you run other text through it.

I disagree, that's simply the behaviour one of the best consumer-facing AI that gets all the air-time at the moment. (Weirdly, loads of people even here talk about AI like it's LLMs even though diffusion based image generators are also making significant progress and being targeted with lawsuits).

AI is automation — the point is to do stuff we don't want to do for whatever reason (including expense), but it does it a bit wrong. People have already died from automation that was carefully engineered but which still had mistakes; machine learning is all about letting a system engineer itself, even if you end up making a checkpoint where it's "good enough", shipping that, and telling people they don't need to train it any more… though they often will keep training it, because that's not actually hard.

We've also got plenty of agentic AI (though as that's a buzzword, bleh, lots of scammers there too), independently of the fact that it's very easy to use even an LLM (which is absolutely not designed or intended for this) as a general agent just by putting it into a loop and telling it the sort of thing it's supposed to be agentic with regards to.

Even with constrained decoding, so far as I can tell the promises are merely advert, while the reality is that's these things are only "pretty good": https://community.openai.com/t/how-to-get-100-valid-json-ans...

(But of course, this is a fast-moving area, so I may just be out of date even though that was only from a few months ago).

However, the "it's only pretty good" becomes "this isn't even possible" in certain domains; this is why, for example, ChatGPT has a disclaimer on the front about not trusting it — there's no way to know, in general, if it's just plain wrong. Which is fine when writing a newspaper column because the Gell-Mann amnesia effect says it was already like that… but not when it's being tasked with anything critical.

Hopefully nobody will use ChatGPT to plan an economy, but the point of automation is to do things for us, so some future AI will almost certainly get used that way. Just as a toy model (because it's late here and I'm tired), imagine if that future AI decides to drop everything and invest only in rice and tulips 0.001% of the time. After all, if it's just as smart as a human, and humans made that mistake…

But on the "what about humans" perspective, you can also look at the environment. I'd say there's no evil moustache twirling villains who like polluting the world, but of course there are genuinely people who do that "to own the libs"; but these are not the main source of pollution in the world, mostly it's people making decisions that seem sensible to them and yet which collectively damage the commons. Plenty of reason to expect an AI to do something that "seems sensible" to its owner, which damages the commons, even if the human is paying attention, which they're probably not doing for the same reason M3 shareholders probably weren't looking very closely to what M3 was doing — "these people are maximising my dividend payments… why is my blood full of microplastics?"

↑