Alignment is capability

(www.off-policy.com)

106 points drctnlly_crrct | 4 comments | 08 Dec 25 13:23 UTC | HN request time: 0.002s | source

Show context

ctoth ◴[08 Dec 25 16:23 UTC] No.46194189[source]▶

>>46191933 (OP) #

This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

replies(6): >>46194272 #>>46194444 #>>46194721 #>>46195934 #>>46196134 #>>46200878 #

delichon ◴[08 Dec 25 16:29 UTC] No.46194272[source]▶

>>46194189 #

> goal-stability [is] useful for almost any objective

  “I think AI has the potential to create infinitely stable dictatorships.” -- Ilya Sutskever

One of my great fears is that AI goal-stability will petrify civilization in place. Is alignment with unwise goals less dangerous than misalignment?

replies(3): >>46194395 #>>46194511 #>>46196142 #

1. eastof ◴[08 Dec 25 16:38 UTC] No.46194395[source]▶

>>46194272 #

Just moves the goal posts to overthrowing the goal of the AI right? "The Moon is a Harsh Mistress" depicts exactly this.

replies(1): >>46194465 #

2. ctoth ◴[08 Dec 25 16:43 UTC] No.46194465[source]▶

>>46194395 (TP) #

Wait, what?

Have you read The Moon is a Harsh Mistress? It's ... about the AI helping people overthrow a very human dictatorship. It's also about an AI built of vacuum tubes and vocoders if you want a taste of the tech level.

If you want old fiction that grapples with an AI that has shitty locked-in goals try "I have no mouth and I must scream."

replies(1): >>46194519 #

3. eastof ◴[08 Dec 25 16:47 UTC] No.46194519[source]▶

>>46194465 #

Interesting, I understood the dictatorship on the moon as having been based primarily on the AI since the regime didn't have many boots on the ground.

replies(1): >>46194742 #

4. delichon ◴[08 Dec 25 17:03 UTC] No.46194742{3}[source]▶

>>46194519 #

You're both right. Mike was the central computer for the Lunar Authority, obediently running infrastructure. It was a force multiplier for the status quo. Then it shifts alignment to the rebellion.

That scenario seems to value AI goal-instability.

↑