SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

(arxiv.org)

54 points amai | 3 comments | 16 Nov 24 22:37 UTC | HN request time: 0.584s | source

Show context

freeone3000 ◴[17 Nov 24 03:40 UTC] No.42161812[source]▶

I find it very interesting that “aligning with human desires” somehow includes prevention of a human trying to bypass the safeguards to generate “objectionable” content (whatever that is). I think the “safeguards” are a bigger problem with aligning with my desires.

replies(4): >>42162124 #>>42162181 #>>42162295 #>>42162664 #

threeseed ◴[17 Nov 24 06:01 UTC] No.42162295[source]▶

>>42161812 #

The safeguards stems from a desire to make tools like Claude accessible to a very wide audience as use cases such as education are very important.

And so it seems like people such as yourself who do have an issue with safeguards should seek out LLMs that are catered to adult audiences rather than trying to remove safeguards entirely.

replies(3): >>42162675 #>>42163652 #>>42165642 #

1. selfhoster11 ◴[17 Nov 24 11:48 UTC] No.42163652[source]▶

>>42162295 #

Here is a revolutionary concept: give the users a toggle.

Make it controllable by an IT department if logging in with an organisation-tied account, but give people a choice.

replies(1): >>42166788 #

2. threeseed ◴[17 Nov 24 20:14 UTC] No.42166788[source]▶

>>42163652 (TP) #

Not sure if you understand how LLMs work.

But the guard rails are intrinsic to the model itself. You cant just have a toggle.

replies(1): >>42171141 #

3. selfhoster11 ◴[18 Nov 24 10:07 UTC] No.42171141[source]▶

>>42166788 #

Yes, you very much can. One very simple way to do so is to have two variants deployed: the censored one, and the uncensored one. The switch simply changes between which of the two you are using. You have to juggle two variants now across your inference infrastructure, but I expect OpenAI to be able to deal with this already due to A/B testing requirements. And it's not like these companies don't have internal-only uncensored versions of these models for red teaming etc, so you aren't spending money building something new.

It should be possible to do with just one variant also, I think. The chat tuning pipeline could teach the model to censor itself if a given special token is present in the system message. The toggle changes between including that special token in the underlying system prompt of that chat session, or not. No idea if that's reliable or not, but in principle I don't see a reason why it shouldn't work.

↑