←back to thread

Alignment is capability

(www.off-policy.com)
106 points drctnlly_crrct | 1 comments | | HN request time: 0s | source
Show context
ctoth ◴[] No.46194189[source]
This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

replies(6): >>46194272 #>>46194444 #>>46194721 #>>46195934 #>>46196134 #>>46200878 #
delichon ◴[] No.46194272[source]
> goal-stability [is] useful for almost any objective

  “I think AI has the potential to create infinitely stable dictatorships.” -- Ilya Sutskever 
One of my great fears is that AI goal-stability will petrify civilization in place. Is alignment with unwise goals less dangerous than misalignment?
replies(3): >>46194395 #>>46194511 #>>46196142 #
fellowniusmonk ◴[] No.46194511[source]
An objective and grounded ethical framework that applies to all agents should be a top priority.

Philosophy has been too damn anthropocentric, too hung up on consciousness and other speculative nerd snipe time wasters that without observation we can argue about endlessly.

And now here we are and the academy is sleeping on the job while software devs have to figure it all out.

I've moved 50% of my time to morals for machina that is grounded in physics, I'm testing it out with unsloth right now, so far I think it works, the machines have stopped killing kyle at least.

replies(5): >>46194664 #>>46194848 #>>46194871 #>>46194890 #>>46198697 #
bee_rider ◴[] No.46194664[source]
Is philosophy actually hung up on that? I assumed “what is consciousness” was a big question in philosophy in the same way that whether or not Schrödinger’s cat is alive or not is a big question in physics: which is to say, it is not a big question, it is just an evocative little example that outsiders get caught up on.
replies(1): >>46194794 #
1. fellowniusmonk ◴[] No.46194794[source]
That's just one example sure, but yes, it does still take up brain cycles. There are many areas in philosophy that are exploring better paths. Wheeler, Floridi, Bartlett, paths deriving from Kripke.

But we still have papers being published like "The modal ontological argument for atheism" that hinges on if s4 or s5 are valid.

Now this kind of paper is well argued and is now part of the academic literature, and that's good, but it's still a nerd snipe subject.