←back to thread

745 points melded | 2 comments | | HN request time: 0s | source
1. Vera_Wilde ◴[] No.45951340[source]
The directional‐ablation approach in Heretic is clever: by identifying residual “refusal directions” and ablating them, they shift the trade-off frontier for the model. In rare‐event screening terms: they’re effectively changing the detection threshold geometry rather than trying just to get better data. It resonates with how improving a test’s accuracy in low-prevalence settings often fails unless you address threshold + base rate.
replies(1): >>45953692 #
2. xmcqdpt2 ◴[] No.45953692[source]
The paper is great. It really shows how alignement is entirely surface level and not actually deeply ingrained in the models. Really interesting work.