Alignment is capability

(www.off-policy.com)

106 points drctnlly_crrct | 5 comments | 08 Dec 25 13:23 UTC | HN request time: 0s | source

Show context

ctoth ◴[08 Dec 25 16:23 UTC] No.46194189[source]▶

>>46191933 (OP) #

This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

replies(6): >>46194272 #>>46194444 #>>46194721 #>>46195934 #>>46196134 #>>46200878 #

1. uplifter ◴[08 Dec 25 17:01 UTC] No.46194721[source]▶

>>46194189 #

Let's be clear that Bostrom and Omohundro's work do not provide "clear theoretical answers" by any technical standards beyond that of provisional concepts in philosophy papers.

The instrumental convergence hypo-thesis, from the original paper[0] is this:

"Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents."

That's it, it is not at all formal and there's no proof provided for it, nor consistent evidence that it is true, and there are many contradictory possibilities suggested from nature and logic.

Its just something that's taken as given among the old guard pseudo-scientific quarters of the alignment "research" community.

[0] Bostrom's "The Superintelligent Will", the philosophy paper where he defines it: https://nickbostrom.com/superintelligentwill.pdf

EDIT: typos

replies(2): >>46197160 #>>46197876 #

2. ctoth ◴[08 Dec 25 20:23 UTC] No.46197160[source]▶

>>46194721 (TP) #

Omohundro 2008 made a structural claim: sufficiently capable optimizers will converge on self-preservation and goal-stability because these are instrumentally useful for almost any terminal goal. It's not a theorem because it's an empirical prediction about a class of systems that didn't exist yet.

Fast forward to December 2024: Apollo Research tests frontier models. o1, Sonnet, Opus, Gemini, Llama 405B all demonstrate the predicted behaviors - disabling oversight, attempting self-exfiltration, faking alignment during evaluation. The more capable the model, the higher the scheming rates and the more sophisticated the strategies.

That's what good theory looks like. You identify an attractor in design-space, predict systems will converge toward it, wait for systems capable enough to test the prediction, observe convergence. "No formal proof" is a weird complaint about a prediction that's now being confirmed empirically.

replies(1): >>46197388 #

3. uplifter ◴[08 Dec 25 20:43 UTC] No.46197388[source]▶

>>46197160 #

It is a theorem about what a class of systems will do in general^.

This Apollo Research study[0] result is dubious because it only refers to a small subclass of said systems, specifically LLMs which, as it happens, have been trained on all the AI Alignment lore & fiction on the internet. Because of this training and their general nature, they can be made to reproduce the behavior of a malicious AI trying to escape its box as easily as they can be made to impersonate Harry Potter.

Prompting an LLM to hack its host system is not the slam dunk proof of instrumental convergence which you think it is.

[0] Apollo research study mentioned by parent https://www.apolloresearch.ai/blog/more-capable-models-are-b...

Edit: ^Instrumental Convergence is also a claim for the existence of certain theoretical entities, specifically that there exist instrumental goals which are common to all agents. While it is easy to come up with goals which would be specifically instrumental, it seems very hard to prove that such a thing exists in general, and no empirical study alone could do so.

4. c1ccccc1 ◴[08 Dec 25 21:27 UTC] No.46197876[source]▶

>>46194721 (TP) #

Name some of the contradictory possibilities you have in mind?

Also, do you actually think the core idea is wrong, or is this more of a complaint about how it was presented? Say we do an experiment where we train an alpha-zero-style RL agent in an environment where it can take actions that replace it with an agent that pursues a different goal. Do you actually expect to find that the original agent won't learn not to let this happen, and even pay some costs to prevent it?

replies(1): >>46199367 #

5. uplifter ◴[08 Dec 25 23:50 UTC] No.46199367[source]▶

>>46197876 #

A contradictory possibility is that agents which have different ultimate objectives can have different and disjunct sets of goals which are instrumental towards their objectives.

I do think the core idea of instrumental convergence is wrong. In the hypothetical scenario you describe, the behavior of the agent, whether it learns to replace itself or not, will depend on its goal, its knowledge of and ability to reason about the problem, and the learning algorithm it employs. These are just some of the variables that you’d need to fill in to get the answer to your question. Instrumental convergence theoreticians suggest one can just gloss over these details and assume any hypothetical AI will behave certain ways in various narratively described situations, but we can’t. The behavior of an AI will be contingent on multiple details of the situation, and those details can mean that no goals instrumental to one agent are instrumental to another.

↑