SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

(arxiv.org)

54 points amai | 4 comments | 16 Nov 24 22:37 UTC | HN request time: 0.451s | source

Show context

freeone3000 ◴[17 Nov 24 03:40 UTC] No.42161812[source]▶

I find it very interesting that “aligning with human desires” somehow includes prevention of a human trying to bypass the safeguards to generate “objectionable” content (whatever that is). I think the “safeguards” are a bigger problem with aligning with my desires.

replies(4): >>42162124 #>>42162181 #>>42162295 #>>42162664 #

1. wruza ◴[17 Nov 24 05:25 UTC] No.42162181[source]▶

>>42161812 #

Another question is whether that initial unalignment comes from poor filtering of datasets, or is it emergent from regular, pre-filtered cultured texts.

In other words, was an “unaligned” LLM taught bad things from bad people, or does it simply see it naturally and point it out with the purity of a child? The latter would mean something about ourselves. Personally I think that people tend to selectively ignore things too much.

replies(1): >>42163677 #

2. GuB-42 ◴[17 Nov 24 11:52 UTC] No.42163677[source]▶

>>42162181 (TP) #

We can't avoid teaching bad things to a LLM if we want it to have useful knowledge. For example, you may teach a LLM about nazis, that's expected knowledge. But then, you can prompt a LLM to be a nazi. You can teach it about how to avoid poisoning yourself, but then, you taught it how to poison people. And the smarter the model is, the better it will be at extracting bad things from good things by negation.

There are actually training dataset full of bad thing by bad people, the intention is to use them negatively, as to teach the LLM some morality.

replies(2): >>42163862 #>>42165478 #

3. ujikoluk ◴[17 Nov 24 12:33 UTC] No.42163862[source]▶

>>42163677 #

Maybe we should just avoid trying to classify things as good or bad.

4. BriggyDwiggs42 ◴[17 Nov 24 17:32 UTC] No.42165478[source]▶

>>42163677 #

But I have no idea why someone might want an LLM to act like a nazi. People read mein kampf in order to study the psychology of a madman and such.

↑