(github.com)

745 points melded | 1 comments | 16 Nov 25 15:00 UTC | HN request time: 0.197s | source

Show context

mwcz ◴[16 Nov 25 16:47 UTC] No.45946436[source]▶

This is so interesting. Safety regular operates along a single dimension, if I'm reading this right. Add a value along that dimension, the model refuses to cooperate, subtract the value, and it will do anything you ask. I'm probably oversimplifying, but I think that's the gist.

Obfuscating model safety may become the next reverse engineering arms race.

replies(1): >>45946461 #

andy99 ◴[16 Nov 25 16:50 UTC] No.45946461[source]▶

>>45946436 #

See https://arxiv.org/abs/2406.11717 Refusal in Language Models Is Mediated by a Single Direction (June 2024)

All “alignment” is extremely shallow, thus the general ease of jailbreaks.

replies(2): >>45946648 #>>45947741 #

mwcz ◴[16 Nov 25 19:36 UTC] No.45947741[source]▶

>>45946461 #

Yes, I wasn't clear, that is the paper I was reading, not the heretic readme.

replies(1): >>45948082 #

1. andy99 ◴[16 Nov 25 20:21 UTC] No.45948082[source]▶

>>45947741 #

Ah, I didn’t actually rtfa and see the paper there, I assumed from your comment it wasn’t mentioned and posted it having known about it :) Anyway hopefully it was useful for someone

↑

Heretic: Automatic censorship removal for language models