A non-anthropomorphized view of LLMs

Some of the arguments are very strange:

> Statements such as "an AI agent could become an insider threat so it needs monitoring" are simultaneously unsurprising (you have a randomized sequence generator fed into your shell, literally anything can happen!) and baffling (you talk as if you believe the dice you play with had a mind of their own and could decide to conspire against you).

> we talk about "behaviors", "ethical constraints", and "harmful actions in pursuit of their goals". All of these are anthropocentric concepts that - in my mind - do not apply to functions or other mathematical objects.

An AI agent, even if it's just "MatMul with interspersed nonlinearities" can be an insider threat. The research proves it:

[PDF] See 4.1.1.2: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686...

It really doesn't matter whether the AI agent is conscious or just crunching numbers on a GPU. If something inside your system is capable of—given some inputs—sabotaging and blackmailing your organization on its own (which is to say, taking on realistic behavior of a threat actor), the outcome is the same! You don't need believe it's thinking, the moment that this software has flipped its bits into "blackmail mode", it's acting nefariously.

The vocabulary to describe what's happening is completely and utterly moot: the software is printing out some reasoning for its actions _and then attempting the actions_. It's making "harmful actions" and the printed context appears to demonstrate a goal that the software is working towards. Whether or not that goal is invented through some linear algebra isn't going to make your security engineers sleep any better.

> This muddles the public discussion. We have many historical examples of humanity ascribing bad random events to "the wrath of god(s)" (earthquakes, famines, etc.), "evil spirits" and so forth. The fact that intelligent highly educated researchers talk about these mathematical objects in anthropomorphic terms makes the technology seem mysterious, scary, and magical.

The anthropomorphization, IMO, is due to the fact that it's _essentially impossible_ to talk about the very real, demonstrable behaviors and problems that LLMs exhibit today without using terms that evoke human functions. We don't have another word for "do" or "remember" or "learn" or "think" when it comes to LLMs that _isn't_ anthropomorphic, and while you can argue endlessly about "hormones" and "neurons" and "millions of years of selection pressure", that's not going to help anyone have a conversation about their work. If AI researchers started coming up with new, non-anthropomorphic verbs, it would be objectively worse and more complicated in every way.

I agree, the dice analogy is an oversimplification. He actually touches on the problem earlier in the article, with the observation that "the paths generated by these mappings look a lot like strange attractors in dynamical systems". It isn't that the dice "conspire against you," it's that the inputs you give the model are often intertwined path-wise with very negative outcomes: the LLM equivalent of a fine line between love and hate. Interacting with an AI about critical security infrastructure is much closer to the 'attractor' of an LLM-generated hack than, say, discussing late 17th century French poetry with it. The very utility of our interactions with AI is thus what makes those interactions potentially dangerous.