SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

1. freeone3000 ◴[17 Nov 24 03:40 UTC] No.42161812[source]▶

I find it very interesting that “aligning with human desires” somehow includes prevention of a human trying to bypass the safeguards to generate “objectionable” content (whatever that is). I think the “safeguards” are a bigger problem with aligning with my desires.

replies(4): >>42162124 #>>42162181 #>>42162295 #>>42162664 #

2. ipython ◴[17 Nov 24 05:09 UTC] No.42162124[source]▶

>>42161812 (TP) #

We’ve seen where that ends up. https://en.m.wikipedia.org/wiki/Tay_(chatbot)

3. wruza ◴[17 Nov 24 05:25 UTC] No.42162181[source]▶

>>42161812 (TP) #

Another question is whether that initial unalignment comes from poor filtering of datasets, or is it emergent from regular, pre-filtered cultured texts.

In other words, was an “unaligned” LLM taught bad things from bad people, or does it simply see it naturally and point it out with the purity of a child? The latter would mean something about ourselves. Personally I think that people tend to selectively ignore things too much.

replies(1): >>42163677 #

4. threeseed ◴[17 Nov 24 06:01 UTC] No.42162295[source]▶

>>42161812 (TP) #

The safeguards stems from a desire to make tools like Claude accessible to a very wide audience as use cases such as education are very important.

And so it seems like people such as yourself who do have an issue with safeguards should seek out LLMs that are catered to adult audiences rather than trying to remove safeguards entirely.

replies(3): >>42162675 #>>42163652 #>>42165642 #

5. Zambyte ◴[17 Nov 24 07:50 UTC] No.42162664[source]▶

>>42161812 (TP) #

What tools do we have to defend against LLM lockdown attacks?

6. Zambyte ◴[17 Nov 24 07:52 UTC] No.42162675[source]▶

>>42162295 #

How does making it harder for the user to extract information they are trying to extract make it safer for a wider audience?

replies(2): >>42162977 #>>42163153 #

7. dbspin ◴[17 Nov 24 09:13 UTC] No.42162977{3}[source]▶

>>42162675 #

Assuming that this question is good faith...

There are numerous things that might be true, that may be damaging to a child's development to be exposed to. From overly punitive criticism to graphic depictions of violence, to advocacy and specific directions for self harm. Countless examples are trivial to generate.

Similarly, the use of these tools is already having dramatic effects on spearfishing, misinformation etc. Guardrails on all the non open-source models have enormous impact on slowing / limiting the damage this has at scale. Even with retrained Llama based models, it's more difficult than you might imagine to create a truly machiavellian or uncensored LLM - which is entirely due to the work that's been doing during and post training to constrain those behaviours. This is an unalloyed good in constraining the weaponisation of LLMs.

8. Drakim ◴[17 Nov 24 09:55 UTC] No.42163153{3}[source]▶

>>42162675 #

That's like asking why we should have porn filters on school computers, after all, all it does is prevent the user from finding what they are looking for, which is bad.

replies(1): >>42172066 #

9. selfhoster11 ◴[17 Nov 24 11:48 UTC] No.42163652[source]▶

>>42162295 #

Here is a revolutionary concept: give the users a toggle.

Make it controllable by an IT department if logging in with an organisation-tied account, but give people a choice.

replies(1): >>42166788 #

10. GuB-42 ◴[17 Nov 24 11:52 UTC] No.42163677[source]▶

>>42162181 #

We can't avoid teaching bad things to a LLM if we want it to have useful knowledge. For example, you may teach a LLM about nazis, that's expected knowledge. But then, you can prompt a LLM to be a nazi. You can teach it about how to avoid poisoning yourself, but then, you taught it how to poison people. And the smarter the model is, the better it will be at extracting bad things from good things by negation.

There are actually training dataset full of bad thing by bad people, the intention is to use them negatively, as to teach the LLM some morality.

replies(2): >>42163862 #>>42165478 #

11. ujikoluk ◴[17 Nov 24 12:33 UTC] No.42163862{3}[source]▶

>>42163677 #

Maybe we should just avoid trying to classify things as good or bad.

12. BriggyDwiggs42 ◴[17 Nov 24 17:32 UTC] No.42165478{3}[source]▶

>>42163677 #

But I have no idea why someone might want an LLM to act like a nazi. People read mein kampf in order to study the psychology of a madman and such.

13. freeone3000 ◴[17 Nov 24 17:57 UTC] No.42165642[source]▶

>>42162295 #

If you are making an LLM for children, I have no problem with that! I’m not sure kids being completely removed from the adult world until suddenly being dumped into it is a great way to build an integrated society, but sure, you do you. Build your LLM with safeguards for educational use, best of luck to you!

I do not think it should be the default. I do not think that “adults” wanting “adult things” like some ideas on how to secure a computer system against social engineering should have to seek out some detuned “jailbroken” lower-quality model.

And I don’t think that assuming everyone is a child aligns with “human desires”, or should be couched in that language.

14. threeseed ◴[17 Nov 24 20:14 UTC] No.42166788{3}[source]▶

>>42163652 #

Not sure if you understand how LLMs work.

But the guard rails are intrinsic to the model itself. You cant just have a toggle.

replies(1): >>42171141 #

15. selfhoster11 ◴[18 Nov 24 10:07 UTC] No.42171141{4}[source]▶

>>42166788 #

Yes, you very much can. One very simple way to do so is to have two variants deployed: the censored one, and the uncensored one. The switch simply changes between which of the two you are using. You have to juggle two variants now across your inference infrastructure, but I expect OpenAI to be able to deal with this already due to A/B testing requirements. And it's not like these companies don't have internal-only uncensored versions of these models for red teaming etc, so you aren't spending money building something new.

It should be possible to do with just one variant also, I think. The chat tuning pipeline could teach the model to censor itself if a given special token is present in the system message. The toggle changes between including that special token in the underlying system prompt of that chat session, or not. No idea if that's reliable or not, but in principle I don't see a reason why it shouldn't work.

16. ◴[18 Nov 24 13:19 UTC] No.42172066{4}[source]▶

>>42163153 #