Signs of introspection in large language models

Yes, it's prompted with the particular experiment that is being done on it, with the "I am an interpretability researcher [...]" prompt. From their previous paper, we already know what happens when concept injection is done and it isn't guided towards introspection: it goes insane trying to relate everything to the golden gate bridge. (This isn't that surprising, given that even most conscious humans don't bother to introspect the question of whether something has gone wrong in their brain until a psychologist points out the possibility.)

The experiment is simply to see whether it can answer with "yes, concept injection is happening" or "no I don't feel anything" after being asked to introspect, with no clues other than a description of the experimental setup and the injection itself. What it says after it has correctly identified concept injection isn't interesting, the game is already up by the time it outputs yes or no. Likewise, an answer that immediately reveals the concept word before making a yes-or-no determination would be non-interesting because the game is given up by the presence of an unrelated word.

I feel like a lot of these comments are misunderstanding the experimental setup they've done here.