←back to thread

745 points melded | 1 comments | | HN request time: 0.197s | source
Show context
mwcz ◴[] No.45946436[source]
This is so interesting. Safety regular operates along a single dimension, if I'm reading this right. Add a value along that dimension, the model refuses to cooperate, subtract the value, and it will do anything you ask. I'm probably oversimplifying, but I think that's the gist.

Obfuscating model safety may become the next reverse engineering arms race.

replies(1): >>45946461 #
andy99 ◴[] No.45946461[source]
See https://arxiv.org/abs/2406.11717 Refusal in Language Models Is Mediated by a Single Direction (June 2024)

All “alignment” is extremely shallow, thus the general ease of jailbreaks.

replies(2): >>45946648 #>>45947741 #
mwcz ◴[] No.45947741[source]
Yes, I wasn't clear, that is the paper I was reading, not the heretic readme.
replies(1): >>45948082 #
1. andy99 ◴[] No.45948082[source]
Ah, I didn’t actually rtfa and see the paper there, I assumed from your comment it wasn’t mentioned and posted it having known about it :) Anyway hopefully it was useful for someone