Alignment is capability

This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

  > conflates two different things called "alignment"

Those are related things, if not the same. The fear of #2 is always caused through #1. Unless we're talking about sentient machines then the danger of AI is the danger of an unintelligent hyper-optimizer. That is: a paperclip maximizer.

The whole paperclip maximizer doomsday scenario was proposed as an illustration of these being the same thing. And I'm with Melanie Mitchell on this one, if a model is super-intelligent then it is not vulnerable to the prompting issues because a super-intelligent machine would be able to trivially infer that humans do in fact prefer to live. No reasonable person would interpret that killing everyone is a reasonable way of making as many paperclips as possible. It's not like there isn't a large amount of writings and data suggesting people want to live, be free, and all that jazz. It's unintelligent AI that is the danger.

This whole thing is predicated on the fact that natural language is ambiguous. I know a lot of people don't think about this much because it works so well but there's a metric fuck ton of ways to interpret any given objective. If you really don't believe me then keep asking yourself "what assumptions have I made?" and get nuanced. For example, I've assumed you understand English, can read, and have some basic understanding of ML systems. I need to do this because I'm not going to write a book to explain it to you. This whole thing is why we write code and math, because it minimizes our assumptions, reducing ambiguity (and yes, those can still be highly ambiguous languages).