←back to thread

Alignment is capability

(www.off-policy.com)
106 points drctnlly_crrct | 2 comments | | HN request time: 0s | source
Show context
ctoth ◴[] No.46194189[source]
This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

replies(6): >>46194272 #>>46194444 #>>46194721 #>>46195934 #>>46196134 #>>46200878 #
1. andy99 ◴[] No.46194444[source]
I take the point to be that if a LLM has a coherent world model it’s basing its output on, this jointly improves its general capabilities like usefully resolving ambiguity, and its ability to stick to whatever alignment is imparted as part of its world model.
replies(1): >>46194576 #
2. ctoth ◴[] No.46194576[source]
"Sticks to whatever alignment is imparted" assumes what gets imparted is alignment rather than alignment-performance on the training distribution.

A coherent world model could make a system more consistently aligned. It could also make it more consistently aligned-seeming. Coherence is a multiplier, not a direction.