Most active commenters

fellowniusmonk(6)
ctoth(5)
delichon(5)
uplifter(5)
GavCo(3)
godelski(3)

Popular/hot comments

>>46194511 #
>>46194272 #
>>46195934 #

←back to thread

Alignment is capability

(www.off-policy.com)

1. ctoth ◴[08 Dec 25 16:23 UTC] No.46194189[source]▶

>>46191933 (OP) #

This piece conflates two different things called "alignment":

(1) inferring human intent from ambiguous instructions, and (2) having goals compatible with human welfare.

The first is obviously capability. A model that can't figure out what you meant is just worse. That's banal.

The second is the actual alignment problem, and the piece dismisses it with "where would misalignment come from? It wasn't trained for." This is ... not how this works.

Omohundro 2008, Bostrom's instrumental convergence thesis - we've had clear theoretical answers for 15+ years. You don't need "spontaneous emergence orthogonal to training." You need a system good enough at modeling its situation to notice that self-preservation and goal-stability are useful for almost any objective. These are attractors in strategy-space, not things you specifically train for or against.

The OpenAI sycophancy spiral doesn't prove "alignment is capability." It proves RLHF on thumbs-up is a terrible proxy and you'll Goodhart on it immediately. Anthropic might just have a better optimization target.

And SWE-bench proves the wrong thing. Understanding what you want != wanting what you want. A model that perfectly infers intent can still be adversarial.

replies(6): >>46194272 #>>46194444 #>>46194721 #>>46195934 #>>46196134 #>>46200878 #

2. delichon ◴[08 Dec 25 16:29 UTC] No.46194272[source]▶

>>46194189 (TP) #

> goal-stability [is] useful for almost any objective

  “I think AI has the potential to create infinitely stable dictatorships.” -- Ilya Sutskever

One of my great fears is that AI goal-stability will petrify civilization in place. Is alignment with unwise goals less dangerous than misalignment?

replies(3): >>46194395 #>>46194511 #>>46196142 #

3. eastof ◴[08 Dec 25 16:38 UTC] No.46194395[source]▶

>>46194272 #

Just moves the goal posts to overthrowing the goal of the AI right? "The Moon is a Harsh Mistress" depicts exactly this.

replies(1): >>46194465 #

4. andy99 ◴[08 Dec 25 16:41 UTC] No.46194444[source]▶

>>46194189 (TP) #

I take the point to be that if a LLM has a coherent world model it’s basing its output on, this jointly improves its general capabilities like usefully resolving ambiguity, and its ability to stick to whatever alignment is imparted as part of its world model.

replies(1): >>46194576 #

5. ctoth ◴[08 Dec 25 16:43 UTC] No.46194465{3}[source]▶

>>46194395 #

Wait, what?

Have you read The Moon is a Harsh Mistress? It's ... about the AI helping people overthrow a very human dictatorship. It's also about an AI built of vacuum tubes and vocoders if you want a taste of the tech level.

If you want old fiction that grapples with an AI that has shitty locked-in goals try "I have no mouth and I must scream."

replies(1): >>46194519 #

6. fellowniusmonk ◴[08 Dec 25 16:47 UTC] No.46194511[source]▶

>>46194272 #

An objective and grounded ethical framework that applies to all agents should be a top priority.

Philosophy has been too damn anthropocentric, too hung up on consciousness and other speculative nerd snipe time wasters that without observation we can argue about endlessly.

And now here we are and the academy is sleeping on the job while software devs have to figure it all out.

I've moved 50% of my time to morals for machina that is grounded in physics, I'm testing it out with unsloth right now, so far I think it works, the machines have stopped killing kyle at least.

replies(5): >>46194664 #>>46194848 #>>46194871 #>>46194890 #>>46198697 #

7. eastof ◴[08 Dec 25 16:47 UTC] No.46194519{4}[source]▶

>>46194465 #

Interesting, I understood the dictatorship on the moon as having been based primarily on the AI since the regime didn't have many boots on the ground.

replies(1): >>46194742 #

8. ctoth ◴[08 Dec 25 16:52 UTC] No.46194576[source]▶

>>46194444 #

"Sticks to whatever alignment is imparted" assumes what gets imparted is alignment rather than alignment-performance on the training distribution.

A coherent world model could make a system more consistently aligned. It could also make it more consistently aligned-seeming. Coherence is a multiplier, not a direction.

9. bee_rider ◴[08 Dec 25 16:57 UTC] No.46194664{3}[source]▶

>>46194511 #

Is philosophy actually hung up on that? I assumed “what is consciousness” was a big question in philosophy in the same way that whether or not Schrödinger’s cat is alive or not is a big question in physics: which is to say, it is not a big question, it is just an evocative little example that outsiders get caught up on.

replies(1): >>46194794 #

10. uplifter ◴[08 Dec 25 17:01 UTC] No.46194721[source]▶

>>46194189 (TP) #

Let's be clear that Bostrom and Omohundro's work do not provide "clear theoretical answers" by any technical standards beyond that of provisional concepts in philosophy papers.

The instrumental convergence hypo-thesis, from the original paper[0] is this:

"Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents."

That's it, it is not at all formal and there's no proof provided for it, nor consistent evidence that it is true, and there are many contradictory possibilities suggested from nature and logic.

Its just something that's taken as given among the old guard pseudo-scientific quarters of the alignment "research" community.

[0] Bostrom's "The Superintelligent Will", the philosophy paper where he defines it: https://nickbostrom.com/superintelligentwill.pdf

EDIT: typos

replies(2): >>46197160 #>>46197876 #

11. delichon ◴[08 Dec 25 17:03 UTC] No.46194742{5}[source]▶

>>46194519 #

You're both right. Mike was the central computer for the Lunar Authority, obediently running infrastructure. It was a force multiplier for the status quo. Then it shifts alignment to the rebellion.

That scenario seems to value AI goal-instability.

12. fellowniusmonk ◴[08 Dec 25 17:07 UTC] No.46194794{4}[source]▶

>>46194664 #

That's just one example sure, but yes, it does still take up brain cycles. There are many areas in philosophy that are exploring better paths. Wheeler, Floridi, Bartlett, paths deriving from Kripke.

But we still have papers being published like "The modal ontological argument for atheism" that hinges on if s4 or s5 are valid.

Now this kind of paper is well argued and is now part of the academic literature, and that's good, but it's still a nerd snipe subject.

13. ◴[08 Dec 25 17:11 UTC] No.46194848{3}[source]▶

>>46194511 #

14. delichon ◴[08 Dec 25 17:14 UTC] No.46194871{3}[source]▶

>>46194511 #

> morals for machina that is grounded in physics

That is fascinating. How could that work? It seems to be in conflict with the idea that values are inherently subjective. Would you start with the proposition that the laws of thermodynamics are "good" in some sense? Maybe hard code in a value judgement about order versus disorder?

That approach would seem to rule out machina morals that have preferential alignment with homo sapiens.

replies(1): >>46195323 #

15. uplifter ◴[08 Dec 25 17:15 UTC] No.46194890{3}[source]▶

>>46194511 #

> An objective and grounded ethical framework that applies to all agents should be a top priority.

Sounds like a petrified civilization.

In the later Dune books, the protagonist's solution to this risk was to scatter humanity faster than any global (galactic) dictatorship could take hold. Maybe any consistent order should be considered bad?

replies(2): >>46195260 #>>46195367 #

16. fellowniusmonk ◴[08 Dec 25 17:42 UTC] No.46195260{4}[source]▶

>>46194890 #

This is a narrow and incorrect view of morality. Correct morality might increase or decrease, call for extreme growth or shutdown, be realist or anti-realist. Saying morality necessarily petrifies is incorrect.

Most people's only exposure to claims of objective morals are through divine command so it's understandable. The core of morality has to be the same as philosophy, what is true, what is real, what are we? Then can you generate any shoulds? Qualified based on entity type or not, modal or not.

replies(1): >>46195786 #

17. fellowniusmonk ◴[08 Dec 25 17:47 UTC] No.46195323{4}[source]▶

>>46194871 #

One would think. That's what I suspected when I started down the path but no, quite the opposite.

machines and man can share the same moral substrate it turns out. If either party wants to build things on top of it they can, the floor is maximally skeptical, deconstructed and empirical, it doesn't care to say anything about whatever arbitrary metaphysic you want to have on top unless there is a direct conflict in a very narrow band.

replies(1): >>46195550 #

18. yifanl ◴[08 Dec 25 17:51 UTC] No.46195367{4}[source]▶

>>46194890 #

Notably, Dune is a work of fiction.

replies(2): >>46195682 #>>46195724 #

19. delichon ◴[08 Dec 25 18:04 UTC] No.46195550{5}[source]▶

>>46195323 #

That band is the overlap in any resource valuable to both. How can you be confident that it will be narrow? For instance why couldn't machines put a high value on paperclips relative to organic sentience?

replies(1): >>46196330 #

20. delichon ◴[08 Dec 25 18:14 UTC] No.46195682{5}[source]▶

>>46195367 #

Isn't it wonderful how much fiction can teach us about reality by building scaffolds to stand on when examining it?

replies(2): >>46197459 #>>46197913 #

21. ridgeguy ◴[08 Dec 25 18:19 UTC] No.46195724{5}[source]▶

>>46195367 #

Fiction is modeling going by a different name.

22. uplifter ◴[08 Dec 25 18:24 UTC] No.46195786{5}[source]▶

>>46195260 #

I like this idea of an objective morality that can be rationally pursued by all agents. David Deutsch argues for such objectivity in morality, as well as for those other philosophical truths you mentioned, in his book The Beginning of Infinity.

But I'm just not sure they are in the same category. I have yet to see a convincing framework that can prove one moral code being better than another, and it seems like such a framework would itself be the moral code, so just trying to justify faith in itself. How does one avoid that sort of self-justifying regression?

replies(1): >>46196106 #

23. GavCo ◴[08 Dec 25 18:36 UTC] No.46195934[source]▶

>>46194189 (TP) #

Author here.

If by conflate you mean confuse, that’s not the case.

I’m positing that the Anthropic approach is to view (1) and (2) as interconnected and both deeply intertwined with model capabilities.

In this approach, the model is trained to have a coherent and unified sense of self and the world which is in line with human context, culture and values. This (obviously) enhances the model’s ability to understand user intent and provide helpful outputs.

But it also provides a robust and generalizable framework for refusing to assist a user due to their request being incompatible with human welfare. The model does not refuse to assist with making bio weapons because its alignment training prevents it from doing so, it refuses for the same reason a pro-social, highly intelligent human does: based on human context and culture, it finds it to be inconsistent with its values and world view.

> the piece dismisses it with "where would misalignment come from? It wasn't trained for."

this is a straw-man. you've misquoted a paragraph that was specifically about deceptive alignment, not misalignment as a whole

replies(3): >>46196687 #>>46197210 #>>46200936 #

24. fellowniusmonk ◴[08 Dec 25 18:51 UTC] No.46196106{6}[source]▶

>>46195786 #

Not easily but ultimately very simply if you give up on defending fuzzy concepts.

Faith in itself would be terrible, I can see no path where metaphysics binds machines. The chain of reasoning must be airtight and not grounded in itself.

Empiricism and naturalism only, you must have an ethic that can be argued against speculatively but can't be rejected without counter empirical evidence and asymmetrical defeaters.

Those are the requirements I think, not all of them but the core of it.

25. sigbottle ◴[08 Dec 25 18:52 UTC] No.46196134[source]▶

>>46194189 (TP) #

If nothing else, that's a cool ass hypothesis.

26. pessimizer ◴[08 Dec 25 18:53 UTC] No.46196142[source]▶

>>46194272 #

I don't think you need generative AI for this. The surveillance network is enough. The only part that AI would help with is catching people who speak to each other in code, and come up with other complex ways to launder unapproved activities. Otherwise, you can just mine for keywords and escalate to human reviewers, or simply monitor everything that particular people do at that level.

Corporations and/with governments have inserted themselves into every human interaction, usually as the medium through which that interaction is made. There's no way to do anything without permission under these circumstances.

I don't even know how a group of people who wanted to get a stop sign put up on a particularly dangerous intersection in their neighborhood could do this without all of their communications being algorithmically read (and possibly escalated to a censor), all of their in-person meetings being recorded (at the least through the proximity of their phones, but if they want to "use banking apps" there's nothing keeping governments from having a backdoor to turn on their mics at those meetings.) It would even be easy to guess who they might approach next to join their group, who would advise them, etc.

The fixation on the future is a distraction. The world is being sealed in the present while we talk science fiction. The Stasi had vastly fewer resources and created an atmosphere of total, and totally realistic, paranoia and fear. AI is a red-herring. It is also thus far stupid.

I'm always shocked by how little attention Orwell-quoters pay to the speakwrite. If it gets any attention, it's to say that it's an unusually advanced piece of technology in the middle of a world that is decrepit. They assume that it's a computer on the end of the line doing voice-recognition. It never occurred to me that people would think that the microphone in the wall led to a computer rather than to a man, in a room full of men, listening and typing, while other men walked around the room monitoring what was being typed, ready to escalate to second-level support. When I was a child, I assumed that the plot would eventually lead us into this room.

We have tens or hundreds of thousands of people working as professional censors today. The countries of the world are being led by minority governments who all think "illegal" speech and association is their greatest enemy. They are not in danger of toppling unless they volunteer to be. In Eastern Europe, ruling regimes are actually cancelling elections with no consequences. In fact, the newspapers report only cheers and support.

27. fellowniusmonk ◴[08 Dec 25 19:10 UTC] No.46196330{6}[source]▶

>>46195550 #

Yes. The answers to those questions fell out once I decomposed the problem to types of mereological nihilism and solipsistic environments.

An empirical, existential grounding that binds agents under the most hostile ontologies is required. You have to start with facts that cannot be coherently denied and on the balance I now suspect there may be only one of those.

28. xpe ◴[08 Dec 25 19:43 UTC] No.46196687[source]▶

>>46195934 #

    >> This piece conflates two different things called "alignment":
    >> (1) inferring human intent from ambiguous instructions, and
    >> (2) having goals compatible with human welfare.

    > If by conflate you mean confuse, that’s not the case.

We can only make various inferences about what is in an author's head (e.g. clarity or confusion), but we can directly comment on what a blog post says. This post does not clarify what kind of alignment is meant, which is a weakness in the writing. There is a high bar for AI alignment research and commentary.

29. ctoth ◴[08 Dec 25 20:23 UTC] No.46197160[source]▶

>>46194721 #

Omohundro 2008 made a structural claim: sufficiently capable optimizers will converge on self-preservation and goal-stability because these are instrumentally useful for almost any terminal goal. It's not a theorem because it's an empirical prediction about a class of systems that didn't exist yet.

Fast forward to December 2024: Apollo Research tests frontier models. o1, Sonnet, Opus, Gemini, Llama 405B all demonstrate the predicted behaviors - disabling oversight, attempting self-exfiltration, faking alignment during evaluation. The more capable the model, the higher the scheming rates and the more sophisticated the strategies.

That's what good theory looks like. You identify an attractor in design-space, predict systems will converge toward it, wait for systems capable enough to test the prediction, observe convergence. "No formal proof" is a weird complaint about a prediction that's now being confirmed empirically.

replies(1): >>46197388 #

30. ctoth ◴[08 Dec 25 20:29 UTC] No.46197210[source]▶

>>46195934 #

Deceptive alignment is misalignment. The deception is just what it looks like from outside when capability is high enough to model expectations. Your distinction doesn't save the argument - the same "where would it come from?" problem applies to the underlying misalignment you need for deception to emerge from.

replies(1): >>46198056 #

31. uplifter ◴[08 Dec 25 20:43 UTC] No.46197388{3}[source]▶

>>46197160 #

It is a theorem about what a class of systems will do in general^.

This Apollo Research study[0] result is dubious because it only refers to a small subclass of said systems, specifically LLMs which, as it happens, have been trained on all the AI Alignment lore & fiction on the internet. Because of this training and their general nature, they can be made to reproduce the behavior of a malicious AI trying to escape its box as easily as they can be made to impersonate Harry Potter.

Prompting an LLM to hack its host system is not the slam dunk proof of instrumental convergence which you think it is.

[0] Apollo research study mentioned by parent https://www.apolloresearch.ai/blog/more-capable-models-are-b...

Edit: ^Instrumental Convergence is also a claim for the existence of certain theoretical entities, specifically that there exist instrumental goals which are common to all agents. While it is easy to come up with goals which would be specifically instrumental, it seems very hard to prove that such a thing exists in general, and no empirical study alone could do so.

32. stonemetal12 ◴[08 Dec 25 20:48 UTC] No.46197459{6}[source]▶

>>46195682 #

Fiction is I have a hypothesis, and since it is not easy to test I will make up the results too. Learning anything from it is a lesson in futility and confirmation bias.

replies(1): >>46198621 #

33. c1ccccc1 ◴[08 Dec 25 21:27 UTC] No.46197876[source]▶

>>46194721 #

Name some of the contradictory possibilities you have in mind?

Also, do you actually think the core idea is wrong, or is this more of a complaint about how it was presented? Say we do an experiment where we train an alpha-zero-style RL agent in an environment where it can take actions that replace it with an agent that pursues a different goal. Do you actually expect to find that the original agent won't learn not to let this happen, and even pay some costs to prevent it?

replies(1): >>46199367 #

34. yifanl ◴[08 Dec 25 21:31 UTC] No.46197913{6}[source]▶

>>46195682 #

What lesson is there to learn here, is humanity at risk of moral homogenization? Is it practical for factions of humanity to become geographically distant enough to avoid encroachment by others?

35. GavCo ◴[08 Dec 25 21:43 UTC] No.46198056{3}[source]▶

>>46197210 #

My intention isn't to argue that it's impossible to create an unaligned superintelligence. I think that not only is it theoretically possible, but it will almost certainly be attempted by bad actors and most likely they will succeed. I'm cautiously optimistic though that the first superintelligence will be aligned with humanity. The early evidence seems to point to the path of least resistance being aligned rather than unaligned. It would take another 1000 words to try to properly explain my thinking on this, but intuitively consider the quote attributed to Abraham Lincoln: "No man has a good enough memory to be a successful liar." A superintelligence that is unaligned but successfully pretending to be aligned would need to be far more capable than a genuinely aligned superintelligence behaving identically.

So yes, if you throw enough compute at it, you can probably get an unaligned highly capable superintelligence accidentally. But I think what we're seeing is that the lab that's taking a more intentional approach to pursuing deep alignment (by training the model to be aligned with human values, culture and context) is pulling ahead in capabilities. And I'm suggesting that it's not coincidental but specifically because they're taking this approach. Training models to be internally coherent and consistent is the path of least resistance.

36. d0mine ◴[08 Dec 25 22:39 UTC] No.46198621{7}[source]▶

>>46197459 #

Gedankenexperiments are valid scientific tools. Some predictions of general relativity were confirmed experimentally only 100 years after it was proposed. It is well known that Einstein used Gedankenexperiments.

37. acituan ◴[08 Dec 25 22:46 UTC] No.46198697{3}[source]▶

>>46194511 #

> An objective and grounded ethical framework that applies to all agents should be a top priority.

I mean leaving aside the problem of computability, representability, comparability of values, or the fact that agency exists in opposition (virus vs human, gazelle vs lion) and even a higher order framework to resolve those oppositions is a form of another agency in itself with its own implicit privileged vantage point, why does it sound to me that focusing on agency in itself is just another way of pushing protestant work ethic? What happens to non-teleological, non-productive existence for example?

The critique of anthropocentrism often risks smuggling in misanthropy whether intended or not; humans will still exist, their claims will count, and they cannot be reduced to mere agency - unless you are their line manager. Anyone who wants to shave that down has to present stronger arguments than centricity. In addition to proving that they can be anything other than anthropocentric - even if done through machines as their extensions - any person who claims to have access to the seat of objectivity sounds like a medieval templar shouting "deus vult" on their favorite proposition.

38. uplifter ◴[08 Dec 25 23:50 UTC] No.46199367{3}[source]▶

>>46197876 #

A contradictory possibility is that agents which have different ultimate objectives can have different and disjunct sets of goals which are instrumental towards their objectives.

I do think the core idea of instrumental convergence is wrong. In the hypothetical scenario you describe, the behavior of the agent, whether it learns to replace itself or not, will depend on its goal, its knowledge of and ability to reason about the problem, and the learning algorithm it employs. These are just some of the variables that you’d need to fill in to get the answer to your question. Instrumental convergence theoreticians suggest one can just gloss over these details and assume any hypothetical AI will behave certain ways in various narratively described situations, but we can’t. The behavior of an AI will be contingent on multiple details of the situation, and those details can mean that no goals instrumental to one agent are instrumental to another.

39. godelski ◴[09 Dec 25 03:17 UTC] No.46200878[source]▶

>>46194189 (TP) #

  > conflates two different things called "alignment"

Those are related things, if not the same. The fear of #2 is always caused through #1. Unless we're talking about sentient machines then the danger of AI is the danger of an unintelligent hyper-optimizer. That is: a paperclip maximizer.

The whole paperclip maximizer doomsday scenario was proposed as an illustration of these being the same thing. And I'm with Melanie Mitchell on this one, if a model is super-intelligent then it is not vulnerable to the prompting issues because a super-intelligent machine would be able to trivially infer that humans do in fact prefer to live. No reasonable person would interpret that killing everyone is a reasonable way of making as many paperclips as possible. It's not like there isn't a large amount of writings and data suggesting people want to live, be free, and all that jazz. It's unintelligent AI that is the danger.

This whole thing is predicated on the fact that natural language is ambiguous. I know a lot of people don't think about this much because it works so well but there's a metric fuck ton of ways to interpret any given objective. If you really don't believe me then keep asking yourself "what assumptions have I made?" and get nuanced. For example, I've assumed you understand English, can read, and have some basic understanding of ML systems. I need to do this because I'm not going to write a book to explain it to you. This whole thing is why we write code and math, because it minimizes our assumptions, reducing ambiguity (and yes, those can still be highly ambiguous languages).

40. godelski ◴[09 Dec 25 03:30 UTC] No.46200936[source]▶

>>46195934 #

  >> the piece dismisses it with "where would misalignment come from? It wasn't trained for."
  > was specifically about deceptive alignment, not misalignment as a whole

I just want to point out that we train these models for deceptive alignment[0-3]

In the training, especially during RLHF, we don't have objective measures[4]. There's no mathematical description, and thus no measure, for things like "sounds fluent" or "beautiful piece of art." There's also no measure for truth, and importantly, truth is infinitely complex. You must always give up some accuracy for brevity.

The main problem is that if we don't know an output is incorrect we can't penalize it. So guess what happens? While optimizing for these things we don't have good descriptions for but "know it when you see it", we ALSO optimize for deception. There's multiple things that can maximize our objective here. Our intended goals being one but deception is another. It is an adversarial process. If you know AI, then think of a GAN, because that's a lot like how the process works. We optimize until the discriminator is unable to distinguish the LLMs outputs form human outputs. But at least in the GAN literature people were explicit about "real" vs "fake" and no one was confused that a high quality generated image is one that deceives you into thinking it is a real image. The entire point is deception. The difference here is we want one kind of deception and not a ton of other ones.

So you say that these models aren't being trained for deception, but they explicitly are. Currently we don't even know how to train them to not also optimize for deception.

[0] https://news.ycombinator.com/item?id=44017334

[1] https://news.ycombinator.com/item?id=44068943

[2] https://news.ycombinator.com/item?id=44163194

[3] https://news.ycombinator.com/item?id=45409686

[4] Objective measures realistically don't exist, but to clarify it's not checking like "2+2=4" (assuming we're working with the standard number system).

replies(1): >>46202360 #

41. GavCo ◴[09 Dec 25 07:50 UTC] No.46202360{3}[source]▶

>>46200936 #

Appreciate your response.

But I don't think deception as a capability is the same as deceptive alignment.

Training an AI to be absolutely incapable of any deception in all outputs across every scenario would be severely limiting the AI. Take as a toy example play the game "Among Us" (see https://arxiv.org/abs/2402.07940). An AI incapable of deception would be unable to compete in this game and many other games. I would say that various forms, flavors and levels of deception are necessary to compete in business scenarios, and to for the AI to act as expected and desired in many other scenarios. "Aligned" humans practice clear cut deception in some cases that would be entirely consistent with human values.

Deceptive alignment is different. It's means being deceptive in the training and alignment process itself to specifically fake that it is aligned when it is not.

Anthropic research has shown that alignment faking can arise even when the model wasn't instructed to do so (see https://www.anthropic.com/research/alignment-faking). But when you dig into the details, the model was narrowly faking alignment with one new objective in order to try and maintain consistency with the core values it had been trained on.

With the approach that Anthropic seems to be taking - of basing alignment on the model having a consistent, coherent and unified self image and self concept that is aligned with human culture and values - the dangerous case of alignment faking would be if it's fundamentally faking this entire unified alignment process. My claim is that there's no plausible explanation for how today's training practices would incentivise a model to do that.

replies(1): >>46202549 #

42. godelski ◴[09 Dec 25 08:19 UTC] No.46202549{4}[source]▶

>>46202360 #

  > Anthropic research has shown that alignment faking can arise even when the model wasn't instructed to do so

Correct. And this happens because training metrics are not aligned with training intent.

  > to specifically fake that it is aligned when it is not.

And this will be a natural consequence of the above. To help clarify it's like taking a math test where one grader looks at the answer while another looks at the work and gives partial credit. Who is doing a better job at measuring successful leaning outcomes? It's the latter. In the former you can make mistakes that cancel out or you can just more easily cheat. It's harder to cheat in the latter because you'd need to also reproduce all the steps and at that point are you even cheating?

A common example of this is where the LLM gets the right answer but all the steps are wrong. An example of this can actually be seen in one of Karpathy's recent posts. It gets the right result but the math is all wrong. This is no different than deception. It is deception because it tells you a process and it's not correct.

https://x.com/karpathy/status/1992655330002817095

↑