(www.anthropic.com)

178 points themgt | 2 comments | 30 Oct 25 16:45 UTC | HN request time: 0.416s | source

Show context

simgt ◴[31 Oct 25 12:11 UTC] No.45771152[source]▶

>>45762064 (OP) #

> First, we find a pattern of neural activity (a vector) representing the concept of “all caps." We do this by recording the model’s neural activations in response to a prompt containing all-caps text, and comparing these to its responses on a control prompt.

What does "comparing" refer to here? Drawing says they are subtracting the activations for two prompts, is it really this easy?

replies(1): >>45771643 #

1. embedding-shape ◴[31 Oct 25 13:14 UTC] No.45771643[source]▶

>>45771152 #

Run with normal prompt > record neural activations

Run with ALL CAPS PROMPT > record neural activations

Then compare/diff them.

It does sound almost too simple to me too, but then lots of ML things sounds "but yeah of course, duh" once they've been "discovered", I guess that's the power of hindsight.

replies(1): >>45776973 #

2. griffzhowl ◴[31 Oct 25 21:36 UTC] No.45776973[source]▶

>>45771643 (TP) #

That's also reminiscent of neuroscience studies with fMRI where the methodology is basically

MRI during task - MRI during control = brain areas involved with the task

In fact it's effectively the same idea. I suppose in both cases the processes in the network are too complicated to usefully analyze directly, and yet the basic principles are simple enough that this comparative procedure gives useful information

↑

Signs of introspection in large language models