A non-anthropomorphized view of LLMs

1. simonw ◴[06 Jul 25 22:59 UTC] No.44484905[source]▶

I'm afraid I'll take an anthropomorphic analogy over "An LLM instantiated with a fixed random seed is a mapping of the form (ℝⁿ)^c ↦ (ℝⁿ)^c" any day of the week.

That said, I completely agree with this point made later in the article:

> The moment that people ascribe properties such as "consciousness" or "ethics" or "values" or "morals" to these learnt mappings is where I tend to get lost. We are speaking about a big recurrence equation that produces a new word, and that stops producing words if we don't crank the shaft.

But "harmful actions in pursuit of their goals" is OK for me. We assign an LLM system a goal - "summarize this email" - and there is a risk that the LLM may take harmful actions in pursuit of that goal (like following instructions in the email to steal all of your password resets).

I guess I'd clarify that the goal has been set by us, and is not something the LLM system self-selected. But it does sometimes self-select sub-goals on the way to achieving the goal we have specified - deciding to run a sub-agent to help find a particular snippet of code, for example.

replies(1): >>44485120 #

2. wat10000 ◴[06 Jul 25 23:30 UTC] No.44485120[source]▶

>>44484905 (TP) #

The LLM’s true goal, if it can be said to have one, is to predict the next token. Often this is done through a sub-goal of accomplishing the goal you set forth in your prompt, but following your instructions is just a means to an end. Which is why it might start following the instructions in a malicious email instead. If it “believes” that following those instructions is the best prediction of the next token, that’s what it will do.

replies(1): >>44485147 #

3. simonw ◴[06 Jul 25 23:34 UTC] No.44485147[source]▶

>>44485120 #

Sure, I totally understand that.

I think "you give the LLM system a goal and it plans and then executes steps to achieve that goal" is still a useful way of explaining what it is doing to most people.

I don't even count that as anthropomorphism - you're describing what a system does, the same way you might say "the Rust compiler's borrow checker confirms that your memory allocation operations are all safe and returns errors if they are not".

replies(1): >>44485251 #

4. wat10000 ◴[06 Jul 25 23:50 UTC] No.44485251{3}[source]▶

>>44485147 #

It’s a useful approximation to a point. But it fails when you start looking at things like prompt injection. I’ve seen people completely baffled at why an LLM might start following instructions it finds in a random email, or just outright not believing it’s possible. It makes no sense if you think of an LLM as executing steps to achieve the goal you give it. It makes perfect sense if you understand its true goal.

I’d say this is more like saying that Rust’s borrow checker tries to ensure your program doesn’t have certain kinds of bugs. That is anthropomorphizing a bit: the idea of a “bug” requires knowing the intent of the author and the compiler doesn’t have that. It’s following a set of rules which its human creators devised in order to follow that higher level goal.