SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

(arxiv.org)

54 points amai | 3 comments | 16 Nov 24 22:37 UTC | HN request time: 0.621s | source

Show context

padolsey ◴[17 Nov 24 02:38 UTC] No.42161544[source]▶

So basically this just adds random characters to input prompts to break jailbreaking attempts? IMHO If you can't make a single-inference solution, you may as well just run a couple of output filters, no? That appeared to have reasonable results, and if you make such filtering more domain-specific, you'll probably make it even better. Intuition says there's no "general solution" to jailbreaking, so maybe it's a lost cause and we need to build up layers of obscurity, of which smooth-llm is just one part.

replies(1): >>42161677 #

1. ipython ◴[17 Nov 24 03:05 UTC] No.42161677[source]▶

>>42161544 #

Right. This seems to be the latest in the “throw random stuff at the wall and see what sticks” series of generative ai papers.

I don’t know if I’m too stupid to understand or if truly this is just “add random stuff to prompt” dressed up in flowery academic language.

replies(1): >>42164069 #

2. pxmpxm ◴[17 Nov 24 13:22 UTC] No.42164069[source]▶

>>42161677 (TP) #

Not surprising - from what I can tell, machine learning has been going down this route for a decade.

Anything involving the higher level abstractions (tensor flow / keras /whatever) is full of handwavy stuff about this or that activation function / number of layers / model architecture working the best and doing a trial error with a different component in the above if it doesn't. Closer to kids playing with legos than statistics.

replies(1): >>42164165 #

3. malwrar ◴[17 Nov 24 13:46 UTC] No.42164165[source]▶

>>42164069 #

I’ve actually noticed this in other areas too. Tons of them just swap parts out of existing works, maybe add a novel idea or two, then boom new proposed technique new paper. I remember when I first noticed it after learning to parse the academic nomenclature for a particular subject I was into at the time (SLAM) and feeling ripped off, but hey if you catch up with a subject it’s a good reading shortcut and helps zoom in on new ideas.

↑