(transformer-circuits.pub)

168 points 1wheel | 1 comments | 21 May 24 15:15 UTC | HN request time: 0.197s | source

Show context

pagekicker ◴[21 May 24 21:54 UTC] No.40434535[source]▶

The article doesn't explain how users can exploit these features in UI or prompt. Does anyone have any insight on how to do so?

replies(1): >>40435715 #

1. CephalopodMD ◴[21 May 24 23:52 UTC] No.40435715[source]▶

>>40434535 #

They explicitly aren't releasing any tools to do this with their models for safety reasons. But you could probably do it from scratch with one of the open models by following their methodology.

↑

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet