←back to thread

Alignment is capability

(www.off-policy.com)
106 points drctnlly_crrct | 1 comments | | HN request time: 0.001s | source
Show context
ctoth ◴[] No.46194189[source]
This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

replies(6): >>46194272 #>>46194444 #>>46194721 #>>46195934 #>>46196134 #>>46200878 #
delichon ◴[] No.46194272[source]
> goal-stability [is] useful for almost any objective

  “I think AI has the potential to create infinitely stable dictatorships.” -- Ilya Sutskever 
One of my great fears is that AI goal-stability will petrify civilization in place. Is alignment with unwise goals less dangerous than misalignment?
replies(3): >>46194395 #>>46194511 #>>46196142 #
fellowniusmonk ◴[] No.46194511[source]
An objective and grounded ethical framework that applies to all agents should be a top priority.

Philosophy has been too damn anthropocentric, too hung up on consciousness and other speculative nerd snipe time wasters that without observation we can argue about endlessly.

And now here we are and the academy is sleeping on the job while software devs have to figure it all out.

I've moved 50% of my time to morals for machina that is grounded in physics, I'm testing it out with unsloth right now, so far I think it works, the machines have stopped killing kyle at least.

replies(5): >>46194664 #>>46194848 #>>46194871 #>>46194890 #>>46198697 #
1. acituan ◴[] No.46198697{3}[source]
> An objective and grounded ethical framework that applies to all agents should be a top priority.

I mean leaving aside the problem of computability, representability, comparability of values, or the fact that agency exists in opposition (virus vs human, gazelle vs lion) and even a higher order framework to resolve those oppositions is a form of another agency in itself with its own implicit privileged vantage point, why does it sound to me that focusing on agency in itself is just another way of pushing protestant work ethic? What happens to non-teleological, non-productive existence for example?

The critique of anthropocentrism often risks smuggling in misanthropy whether intended or not; humans will still exist, their claims will count, and they cannot be reduced to mere agency - unless you are their line manager. Anyone who wants to shave that down has to present stronger arguments than centricity. In addition to proving that they can be anything other than anthropocentric - even if done through machines as their extensions - any person who claims to have access to the seat of objectivity sounds like a medieval templar shouting "deus vult" on their favorite proposition.