(transformer-circuits.pub)

168 points 1wheel | 3 comments | 21 May 24 15:15 UTC | HN request time: 0.551s | source

1. gautomdas ◴[22 May 24 02:36 UTC] No.40436795[source]▶

I've really been enjoying their series on mech interp, does anyone have any other good recs?

replies(2): >>40437371 #>>40441436 #

2. kromem ◴[22 May 24 04:35 UTC] No.40437371[source]▶

>>40436795 (TP) #

The Othello-GPT and Chess-GPT lines of work.

Was the first research work that clued me into what Anthropic's work today ended up demonstrating.

3. PoignardAzur ◴[22 May 24 14:28 UTC] No.40441436[source]▶

>>40436795 (TP) #

"Transformers Represent Belief State Geometry in their Residual Stream":

https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transforme...

Basically finding that transformers don't just store a world-model as in "what does the world that produce the observed inputs look like?", they store a "Mixed-State Presentation", basically a weighted set of possible worlds that produce the observed inputs.

↑

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet