Alignment is capability

(www.off-policy.com)

106 points drctnlly_crrct | 2 comments | 08 Dec 25 13:23 UTC | HN request time: 0.001s | source

Show context

ctoth ◴[08 Dec 25 16:23 UTC] No.46194189[source]▶

>>46191933 (OP) #

This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

replies(6): >>46194272 #>>46194444 #>>46194721 #>>46195934 #>>46196134 #>>46200878 #

delichon ◴[08 Dec 25 16:29 UTC] No.46194272[source]▶

>>46194189 #

> goal-stability [is] useful for almost any objective

  “I think AI has the potential to create infinitely stable dictatorships.” -- Ilya Sutskever

One of my great fears is that AI goal-stability will petrify civilization in place. Is alignment with unwise goals less dangerous than misalignment?

replies(3): >>46194395 #>>46194511 #>>46196142 #

fellowniusmonk ◴[08 Dec 25 16:47 UTC] No.46194511[source]▶

>>46194272 #

An objective and grounded ethical framework that applies to all agents should be a top priority.

Philosophy has been too damn anthropocentric, too hung up on consciousness and other speculative nerd snipe time wasters that without observation we can argue about endlessly.

And now here we are and the academy is sleeping on the job while software devs have to figure it all out.

I've moved 50% of my time to morals for machina that is grounded in physics, I'm testing it out with unsloth right now, so far I think it works, the machines have stopped killing kyle at least.

replies(5): >>46194664 #>>46194848 #>>46194871 #>>46194890 #>>46198697 #

uplifter ◴[08 Dec 25 17:15 UTC] No.46194890[source]▶

>>46194511 #

> An objective and grounded ethical framework that applies to all agents should be a top priority.

Sounds like a petrified civilization.

In the later Dune books, the protagonist's solution to this risk was to scatter humanity faster than any global (galactic) dictatorship could take hold. Maybe any consistent order should be considered bad?

replies(2): >>46195260 #>>46195367 #

fellowniusmonk ◴[08 Dec 25 17:42 UTC] No.46195260[source]▶

>>46194890 #

This is a narrow and incorrect view of morality. Correct morality might increase or decrease, call for extreme growth or shutdown, be realist or anti-realist. Saying morality necessarily petrifies is incorrect.

Most people's only exposure to claims of objective morals are through divine command so it's understandable. The core of morality has to be the same as philosophy, what is true, what is real, what are we? Then can you generate any shoulds? Qualified based on entity type or not, modal or not.

replies(1): >>46195786 #

1. uplifter ◴[08 Dec 25 18:24 UTC] No.46195786[source]▶

>>46195260 #

I like this idea of an objective morality that can be rationally pursued by all agents. David Deutsch argues for such objectivity in morality, as well as for those other philosophical truths you mentioned, in his book The Beginning of Infinity.

But I'm just not sure they are in the same category. I have yet to see a convincing framework that can prove one moral code being better than another, and it seems like such a framework would itself be the moral code, so just trying to justify faith in itself. How does one avoid that sort of self-justifying regression?

replies(1): >>46196106 #

2. fellowniusmonk ◴[08 Dec 25 18:51 UTC] No.46196106[source]▶

>>46195786 (TP) #

Not easily but ultimately very simply if you give up on defending fuzzy concepts.

Faith in itself would be terrible, I can see no path where metaphysics binds machines. The chain of reasoning must be airtight and not grounded in itself.

Empiricism and naturalism only, you must have an ethic that can be argued against speculatively but can't be rejected without counter empirical evidence and asymmetrical defeaters.

Those are the requirements I think, not all of them but the core of it.

↑