←back to thread

Alignment is capability

(www.off-policy.com)
106 points drctnlly_crrct | 1 comments | | HN request time: 0s | source
Show context
ctoth ◴[] No.46194189[source]
This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

replies(6): >>46194272 #>>46194444 #>>46194721 #>>46195934 #>>46196134 #>>46200878 #
GavCo ◴[] No.46195934[source]
Author here.

If by conflate you mean confuse, that’s not the case.

I’m positing that the Anthropic approach is to view (1) and (2) as interconnected and both deeply intertwined with model capabilities.

In this approach, the model is trained to have a coherent and unified sense of self and the world which is in line with human context, culture and values. This (obviously) enhances the model’s ability to understand user intent and provide helpful outputs.

But it also provides a robust and generalizable framework for refusing to assist a user due to their request being incompatible with human welfare. The model does not refuse to assist with making bio weapons because its alignment training prevents it from doing so, it refuses for the same reason a pro-social, highly intelligent human does: based on human context and culture, it finds it to be inconsistent with its values and world view.

> the piece dismisses it with "where would misalignment come from? It wasn't trained for."

this is a straw-man. you've misquoted a paragraph that was specifically about deceptive alignment, not misalignment as a whole

replies(3): >>46196687 #>>46197210 #>>46200936 #
ctoth ◴[] No.46197210[source]
Deceptive alignment is misalignment. The deception is just what it looks like from outside when capability is high enough to model expectations. Your distinction doesn't save the argument - the same "where would it come from?" problem applies to the underlying misalignment you need for deception to emerge from.
replies(1): >>46198056 #
1. GavCo ◴[] No.46198056[source]
My intention isn't to argue that it's impossible to create an unaligned superintelligence. I think that not only is it theoretically possible, but it will almost certainly be attempted by bad actors and most likely they will succeed. I'm cautiously optimistic though that the first superintelligence will be aligned with humanity. The early evidence seems to point to the path of least resistance being aligned rather than unaligned. It would take another 1000 words to try to properly explain my thinking on this, but intuitively consider the quote attributed to Abraham Lincoln: "No man has a good enough memory to be a successful liar." A superintelligence that is unaligned but successfully pretending to be aligned would need to be far more capable than a genuinely aligned superintelligence behaving identically.

So yes, if you throw enough compute at it, you can probably get an unaligned highly capable superintelligence accidentally. But I think what we're seeing is that the lab that's taking a more intentional approach to pursuing deep alignment (by training the model to be aligned with human values, culture and context) is pulling ahead in capabilities. And I'm suggesting that it's not coincidental but specifically because they're taking this approach. Training models to be internally coherent and consistent is the path of least resistance.