←back to thread

Alignment is capability

(www.off-policy.com)
106 points drctnlly_crrct | 4 comments | | HN request time: 0.002s | source
Show context
ctoth ◴[] No.46194189[source]
This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

replies(6): >>46194272 #>>46194444 #>>46194721 #>>46195934 #>>46196134 #>>46200878 #
delichon ◴[] No.46194272[source]
> goal-stability [is] useful for almost any objective

  “I think AI has the potential to create infinitely stable dictatorships.” -- Ilya Sutskever 
One of my great fears is that AI goal-stability will petrify civilization in place. Is alignment with unwise goals less dangerous than misalignment?
replies(3): >>46194395 #>>46194511 #>>46196142 #
1. eastof ◴[] No.46194395[source]
Just moves the goal posts to overthrowing the goal of the AI right? "The Moon is a Harsh Mistress" depicts exactly this.
replies(1): >>46194465 #
2. ctoth ◴[] No.46194465[source]
Wait, what?

Have you read The Moon is a Harsh Mistress? It's ... about the AI helping people overthrow a very human dictatorship. It's also about an AI built of vacuum tubes and vocoders if you want a taste of the tech level.

If you want old fiction that grapples with an AI that has shitty locked-in goals try "I have no mouth and I must scream."

replies(1): >>46194519 #
3. eastof ◴[] No.46194519[source]
Interesting, I understood the dictatorship on the moon as having been based primarily on the AI since the regime didn't have many boots on the ground.
replies(1): >>46194742 #
4. delichon ◴[] No.46194742{3}[source]
You're both right. Mike was the central computer for the Lunar Authority, obediently running infrastructure. It was a force multiplier for the status quo. Then it shifts alignment to the rebellion.

That scenario seems to value AI goal-instability.