Signs of introspection in large language models

Bah. It's a really cool idea, but a rather crude way to measure the outputs.

If you just ask the model in plain text, the actual "decision" whether it detected anything or not is made by by the time it outputs the second word ("don't" vs. "notice"). The rest of the output builds up from that one token and is not that interesting.

A way cooler way to run such experiments is to measure the actual token probabilities at such decision points. OpenAI has the logprob API for that, don't know about Anthropic. If not, you can sort of proxy it by asking the model to rate on a scale from 0-9 (must be a single token!) how much it think it's being under influence. The score must be the first token in its output though!

Another interesting way to measure would be to ask it for a JSON like this:

  "possible injected concept in 1 word" : <strength 0-9>, ...

Again, the rigid structure of the JSON will eliminate the interference from the language structure, and will give more consistent and measurable outputs.

It's also notable how over-amplifying the injected concept quickly overpowers the pathways trained to reproduce the natural language structure, so the model becomes totally incoherent.

I would love to fiddle with something like this in Ollama, but am not very familiar with its internals. Can anyone here give a brief pointer where I should be looking if I wanted to access the activation vector from a particular layer before it starts producing the tokens?