←back to thread

146 points jakozaur | 6 comments | | HN request time: 1.316s | source | bottom
Show context
simonw ◴[] No.45670650[source]
If you can get malicious instructions into the context of even the most powerful reasoning LLMs in the world you'll still be able to trick them into outputting vulnerable code like this if you try hard enough.

I don't think the fact that small models are easier to trick is particularly interesting from a security perspective, because you need to assume that ANY model can be prompt injected by a suitably motivated attacker.

On that basis I agree with the article that we need to be using additional layers of protection that work against compromised models, such as robust sandboxed execution of generated code and maybe techniques like static analysis too (I'm less sold on those, I expect plenty of malicious vulnerabilities could sneak past them.)

Coincidentally I gave a talk about sandboxing coding agents last night: https://simonwillison.net/2025/Oct/22/living-dangerously-wit...

replies(3): >>45671268 #>>45671294 #>>45673229 #
knowaveragejoe ◴[] No.45671268[source]
Is there any chance your talk was recorded?
replies(1): >>45671321 #
1. simonw ◴[] No.45671321[source]
It wasn't, but the written version of it it is actually better than what I said in the room (since I got to think a little bit harder and add relevant links).
replies(1): >>45673009 #
2. semi-extrinsic ◴[] No.45673009[source]
IIUC your talk "just" suggests using sandbox-exec on Mac, which (as you point out) is sadly labeled as deprecated.

Is that really the best solution the world has to offer in 2025? LLMs aside, there is a whole host of supply chain risk issues that would be resolved by deploying convenient and strong sandboxes everywhere.

replies(1): >>45673573 #
3. simonw ◴[] No.45673573[source]
My preferred solutions right now:

1. A sandbox on someone else's computer. Claude Code for web, Codex Cloud, Gemini Jules, GitHub Codespaces, ChatGPT/Claude Code Interpreter

2. A Docker container. I think these are robust enough to be safe.

3. sandbox-exec related tricks. I haven't poked hard enough at Claude Code's new sandbox-exec sandbox yet - they only released it on Monday. OpenAI Codex CLI was using sandbox-exec too last time I looked but again, I've not reviewed it enough to be comfortable with it.

I'm hoping more credible options come along for the sandboxing problems.

replies(2): >>45673945 #>>45676047 #
4. knowaveragejoe ◴[] No.45673945{3}[source]
If I understand correctly, Claude Code will(shortly, if not already) make use of Anthropic's sandbox that wraps Seatbelt on OS X, not sandbox-exec?

It's cool that they made this open source. It seems straightforward and useful enough that it could be used on its own for sandboxing purposes.

https://docs.claude.com/en/docs/claude-code/sandboxing

https://github.com/anthropic-experimental/sandbox-runtime

replies(1): >>45674495 #
5. simonw ◴[] No.45674495{4}[source]
Yeah they shipped that feature on Monday, you can access it via the /sandbox command. I haven't put it through its paces enough to get a feel for if I trust it yet though.
6. mentalgear ◴[] No.45676047{3}[source]
I found Vibekit's (open-source https://docs.vibekit.sh/sdk) approach of allowing you to chose your own sandboxing solution for any coding cli the most flexible. Also works with openCode and local or cloud sandboxes ! Really quality piece of software that more devs should know about. I'm surprised Simon hasn't tried it yet.