Alignment is capability

(www.off-policy.com)

106 points drctnlly_crrct | 4 comments | 08 Dec 25 13:23 UTC | HN request time: 0s | source

Show context

ctoth ◴[08 Dec 25 16:23 UTC] No.46194189[source]▶

>>46191933 (OP) #

This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

replies(6): >>46194272 #>>46194444 #>>46194721 #>>46195934 #>>46196134 #>>46200878 #

delichon ◴[08 Dec 25 16:29 UTC] No.46194272[source]▶

>>46194189 #

> goal-stability [is] useful for almost any objective

  “I think AI has the potential to create infinitely stable dictatorships.” -- Ilya Sutskever

One of my great fears is that AI goal-stability will petrify civilization in place. Is alignment with unwise goals less dangerous than misalignment?

replies(3): >>46194395 #>>46194511 #>>46196142 #

fellowniusmonk ◴[08 Dec 25 16:47 UTC] No.46194511[source]▶

>>46194272 #

An objective and grounded ethical framework that applies to all agents should be a top priority.

Philosophy has been too damn anthropocentric, too hung up on consciousness and other speculative nerd snipe time wasters that without observation we can argue about endlessly.

And now here we are and the academy is sleeping on the job while software devs have to figure it all out.

I've moved 50% of my time to morals for machina that is grounded in physics, I'm testing it out with unsloth right now, so far I think it works, the machines have stopped killing kyle at least.

replies(5): >>46194664 #>>46194848 #>>46194871 #>>46194890 #>>46198697 #

1. delichon ◴[08 Dec 25 17:14 UTC] No.46194871{3}[source]▶

>>46194511 #

> morals for machina that is grounded in physics

That is fascinating. How could that work? It seems to be in conflict with the idea that values are inherently subjective. Would you start with the proposition that the laws of thermodynamics are "good" in some sense? Maybe hard code in a value judgement about order versus disorder?

That approach would seem to rule out machina morals that have preferential alignment with homo sapiens.

replies(1): >>46195323 #

2. fellowniusmonk ◴[08 Dec 25 17:47 UTC] No.46195323[source]▶

>>46194871 (TP) #

One would think. That's what I suspected when I started down the path but no, quite the opposite.

machines and man can share the same moral substrate it turns out. If either party wants to build things on top of it they can, the floor is maximally skeptical, deconstructed and empirical, it doesn't care to say anything about whatever arbitrary metaphysic you want to have on top unless there is a direct conflict in a very narrow band.

replies(1): >>46195550 #

3. delichon ◴[08 Dec 25 18:04 UTC] No.46195550[source]▶

>>46195323 #

That band is the overlap in any resource valuable to both. How can you be confident that it will be narrow? For instance why couldn't machines put a high value on paperclips relative to organic sentience?

replies(1): >>46196330 #

4. fellowniusmonk ◴[08 Dec 25 19:10 UTC] No.46196330{3}[source]▶

>>46195550 #

Yes. The answers to those questions fell out once I decomposed the problem to types of mereological nihilism and solipsistic environments.

An empirical, existential grounding that binds agents under the most hostile ontologies is required. You have to start with facts that cannot be coherently denied and on the balance I now suspect there may be only one of those.

↑