Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

(transformer-circuits.pub)

168 points 1wheel | 2 comments | 21 May 24 15:15 UTC | HN request time: 0.425s | source

Show context

byteknight ◴[21 May 24 15:09 UTC] No.40429457[source]▶

This reminds me of how people often communicate to avoid offending others. We tend to soften our opinions or suggestions with phrases like "What if you looked at it this way?" or "You know what I'd do in those situations." By doing this, we subtly dilute the exact emotion or truth we're trying to convey. If we modify our words enough, we might end up with a statement that's completely untruthful. This is similar to how AI models might behave when manipulated to emphasize certain features, leading to responses that are not entirely genuine.

replies(2): >>40429580 #>>40429898 #

HarHarVeryFunny ◴[21 May 24 15:42 UTC] No.40429898[source]▶

>>40429457 #

A true AGI would learn to manipulate it's environment to achieve it's goals, but obviously we are not there yet.

An LLM has no goals - it's just a machine optimized to minimize training errors, although I suppose you could view this as an innate hard-coded goal of minimizing next word error (relative to training set), in same way we might say a machine-like insect has some "goals".

Of course RLHF provides a longer time span (entire response vs next word) error to minimize, but I doubt training volume is enough for the model to internally model a goal of manipulating the listener as opposed to just favoring surface forms of response.

replies(2): >>40431850 #>>40440967 #

1. rpozarickij ◴[22 May 24 13:47 UTC] No.40440967[source]▶

>>40429898 #

The next big breakthrough in the LLM space will be having a way to represent goals/intentions of the LLM and then execute them in the way that is the most appropriate/logical/efficient (I'm pretty sure some really smart people have been thinking about this for a while).

Perhaps at some point LLMs will start to evolve from the prompt->response model into something more asynchronous and with some activity happening in the background too.

replies(1): >>40464967 #

2. wumbo ◴[24 May 24 11:17 UTC] No.40464967[source]▶

>>40440967 (TP) #

That’s not really an LLM at that point, but a agent built around an LLM

↑