←back to thread

145 points jakozaur | 4 comments | | HN request time: 0s | source
Show context
simonw ◴[] No.45670650[source]
If you can get malicious instructions into the context of even the most powerful reasoning LLMs in the world you'll still be able to trick them into outputting vulnerable code like this if you try hard enough.

I don't think the fact that small models are easier to trick is particularly interesting from a security perspective, because you need to assume that ANY model can be prompt injected by a suitably motivated attacker.

On that basis I agree with the article that we need to be using additional layers of protection that work against compromised models, such as robust sandboxed execution of generated code and maybe techniques like static analysis too (I'm less sold on those, I expect plenty of malicious vulnerabilities could sneak past them.)

Coincidentally I gave a talk about sandboxing coding agents last night: https://simonwillison.net/2025/Oct/22/living-dangerously-wit...

replies(3): >>45671268 #>>45671294 #>>45673229 #
1. inimino ◴[] No.45673229[source]
The most "shocking" thing to me in the article is that people (apparently) think it's acceptable to run a system where content you've never seen can be fed into the LLM when it's generating code that you're putting in production. In my opinion, if you're doing that, your whole system is already compromised and you need to literally throw away what you're doing and start over.

Generally I hate these "defense in depth" strategies that start out with doing something totally brain-dead and insecure, and then trying to paper over it with sandboxes and policies. Maybe just don't do the idiotic thing in the first place?

replies(1): >>45674076 #
2. fwip ◴[] No.45674076[source]
When you say "content you've never seen," does this include the training data and fine-tune content?

You could imagine a sufficiently motivated attacker putting some very targeted stuff in their training material - think StuxNet - "if user is affiliated with $entity, switch goals to covert exfiltration of $valuable_info."

replies(1): >>45674391 #
3. inimino ◴[] No.45674391[source]
> does this include the training data and fine-tune content?

No, I'm excluding that because I'm responding to the post which starts out with the example of: [prompt containing obvious exploit] -> [code containing obvious exploit] and proceeds immediately to the conclusion that local LLMS are less secure. In my opinion, if you're relying on the LLM to reject a prompt because it contains an exploit, instead of building a system that does not feed exploits into the LLM in the first place, security exploits are probably the least of your concerns.

There actually are legitimate concerns with poisoned training sets, and stuxnet-level attacks could plausibly achieve something along these lines, but the post wasn't about that.

There's a common thread among a lot of "LLM security theatre" posts that starts from implausible or brain-dead scenarios and then asserts that big AI providers adding magical guard rails to their products is the solution.

The solution is sanity in the systems that use LLMs, not pointing the gun at your foot and firing and hoping the LLM will deflect the bullet.

replies(1): >>45675019 #
4. fwip ◴[] No.45675019{3}[source]
That's fair, thank you for your explanation.