SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

(arxiv.org)

54 points amai | 1 comments | 16 Nov 24 22:37 UTC | HN request time: 0.202s | source

Show context

freeone3000 ◴[17 Nov 24 03:40 UTC] No.42161812[source]▶

I find it very interesting that “aligning with human desires” somehow includes prevention of a human trying to bypass the safeguards to generate “objectionable” content (whatever that is). I think the “safeguards” are a bigger problem with aligning with my desires.

replies(4): >>42162124 #>>42162181 #>>42162295 #>>42162664 #

threeseed ◴[17 Nov 24 06:01 UTC] No.42162295[source]▶

>>42161812 #

The safeguards stems from a desire to make tools like Claude accessible to a very wide audience as use cases such as education are very important.

And so it seems like people such as yourself who do have an issue with safeguards should seek out LLMs that are catered to adult audiences rather than trying to remove safeguards entirely.

replies(3): >>42162675 #>>42163652 #>>42165642 #

Zambyte ◴[17 Nov 24 07:52 UTC] No.42162675[source]▶

>>42162295 #

How does making it harder for the user to extract information they are trying to extract make it safer for a wider audience?

replies(2): >>42162977 #>>42163153 #

1. dbspin ◴[17 Nov 24 09:13 UTC] No.42162977[source]▶

>>42162675 #

Assuming that this question is good faith...

There are numerous things that might be true, that may be damaging to a child's development to be exposed to. From overly punitive criticism to graphic depictions of violence, to advocacy and specific directions for self harm. Countless examples are trivial to generate.

Similarly, the use of these tools is already having dramatic effects on spearfishing, misinformation etc. Guardrails on all the non open-source models have enormous impact on slowing / limiting the damage this has at scale. Even with retrained Llama based models, it's more difficult than you might imagine to create a truly machiavellian or uncensored LLM - which is entirely due to the work that's been doing during and post training to constrain those behaviours. This is an unalloyed good in constraining the weaponisation of LLMs.

↑