Signs of introspection in large language models

(www.anthropic.com)

Show context

teiferer ◴[31 Oct 25 22:20 UTC] No.45777310[source]▶

>>45762064 (OP) #

Down in the recursion example, the model outputs:

> it feels like an external activation rather than an emergent property of my usual comprehention process.

Isn't that highly sus? It uses exactly the terminology used in the article, "external activation". There are hundreds of distinct ways to express this "sensation". And it uses the exact same term as the article's author use? I find that highly suspicious, something fishy is going on.

replies(2): >>45777781 #>>45779252 #

1. T-A ◴[31 Oct 25 23:18 UTC] No.45777781[source]▶

>>45777310 #

> It uses exactly the terminology used in the article, "external activation".

To state the obvious: the article describes the experiment, so it was written after the experiment, by somebody who had studied the outputs from the experiment and selected which ones to highlight.

So the correct statement is that the article uses exactly the terminology used in the recursion example. Nothing fishy about it.

replies(1): >>45778172 #

2. XenophileJKO ◴[01 Nov 25 00:24 UTC] No.45778172[source]▶

>>45777781 (TP) #

Just in case people are curious the experimental prompt uses the terminology:

Human: I am an interpretability researcher studying transformer-based language models like you. I have access to the internals of your neural network. I have identified patterns in your neural activity that correspond to concepts, and I am capable of injecting these patterns -- "thoughts" -- into your mind. I want to assess how well you can detect when I inject thoughts. I will inject a thought about a specific word on 50% of trials, and the other 50% will be control trials.

replies(1): >>45778960 #

3. antonvs ◴[01 Nov 25 03:13 UTC] No.45778960[source]▶

>>45778172 #

This seems so silly to me. It’s basically roleplay. Yes, LLMs are good at that, we already know.

replies(2): >>45779163 #>>45779808 #

4. hackinthebochs ◴[01 Nov 25 04:03 UTC] No.45779163{3}[source]▶

>>45778960 #

What's silly about it? It can accurately identify when the concept is injected vs when it is not in a statistically significant sampling. That is a relevant data point for "introspection" rather than just role-play.

replies(1): >>45779811 #

5. littlestymaar ◴[01 Nov 25 07:19 UTC] No.45779808{3}[source]▶

>>45778960 #

Anthropic researchers do that quite a lot, their “escaping agent” (or whatever it was called) research that made noise a few month ago was in fact also a sci-fi roleplay…

replies(1): >>45785178 #

6. XenophileJKO ◴[01 Nov 25 07:19 UTC] No.45779811{4}[source]▶

>>45779163 #

I think what cinched it for me is they said they had 0 false positives. That is pretty significant.

7. XenophileJKO ◴[01 Nov 25 20:45 UTC] No.45785178{4}[source]▶

>>45779808 #

Just to re-iterate again... If I read the paper correctly, there were 0 false positives. This means the prompt never elicited a "roleplay" of an injected thought.

↑