Living Dangerously with Claude

(simonwillison.net)

217 points FromTheArchives | 1 comments | 22 Oct 25 12:36 UTC | HN request time: 0.001s | source

Show context

matthewdgreen ◴[23 Oct 25 01:10 UTC] No.45677089[source]▶

So let me get this straight. You’re writing tens of thousands of lines of code that will presumably go into a public GitHub repository and/or be served from some location. Even if it only runs locally on your own machine, at some point you’ll presumably give that code network access. And that code is being developed (without much review) by an agent that, in our threat model, has been fully subverted by prompt injection?

Sandboxing the agent hardly seems like a sufficient defense here.

replies(3): >>45677537 #>>45684527 #>>45686450 #

tptacek ◴[23 Oct 25 17:34 UTC] No.45684527[source]▶

>>45677089 #

Where did "without much review" come from? I don't see that in the deck.

replies(2): >>45684731 #>>45688191 #

matthewdgreen ◴[23 Oct 25 22:29 UTC] No.45688191[source]▶

>>45684527 #

He wrote 14,000 lines of code in several days. How much review is going on there?

replies(1): >>45688711 #

simonw ◴[23 Oct 25 23:20 UTC] No.45688711[source]▶

>>45688191 #

Oh hang on, I think I've spotted a point of confusion here.

All three of the projects I described in this talk have effectively zero risk in terms of containing harmful unreviewed code.

DeepSeek-OCR on the Spark? I ran that one in a Docker container, saved some notes on the process and then literally threw away the container once it had finished.

The Pyodide in Node.js one I did actually review, because its code I execute on a machine that isn't disposable. The initial research ran in a disposable remote container though (Claude Code for web).

The Perl in WebAssembly one? That runs in a browser sandbox. There's effectively nothing bad that can happen there, that's why I like WebAssembly so much.

I am a whole lot more cautious in reviewing code that has real stakes attached to it.

replies(1): >>45691823 #

matthewdgreen ◴[24 Oct 25 07:19 UTC] No.45691823[source]▶

>>45688711 #

Understood. I read the article as “here is how to do YOLO coding safely”, and part of the “safely” idea was to sandbox the coding agent. I’m just pointing out that this, by itself, seems insufficient to prevent ugly exfiltration, it just makes exfiltration take an extra step. I’m also not sure that human code review scales to this much code, nor that it can contain that kind of exfiltration if the instructions specify some kind of obfuscation.

Obviously your recommendation to sandbox network access is one of several you make (the most effective one being “don’t let the agent ever touch sensitive data”), so I’m not saying the combined set of protections won’t work well. I’m also not saying that your projects specifically have any risk, just that they illustrate how much code you can end up with very quickly — making human review a fool’s errand.

ETA: if you do think human review can prevent secret exfiltration, I’d love to turn that into some kind of competition. Think of it as the obfuscated C contest with a scarier twist.

replies(2): >>45695412 #>>45695564 #

tptacek ◴[24 Oct 25 15:05 UTC] No.45695412[source]▶

>>45691823 #

Is it your claim that LLMs will produce subtly obfuscated secret exfiltrations?

replies(1): >>45696880 #

matthewdgreen ◴[24 Oct 25 17:22 UTC] No.45696880[source]▶

>>45695412 #

Yes. If by "subtly obfuscated" you mean anything from 'tucked into a comment without encoding, where you're unlikely to notice it', to 'encoded in invisible Unicode' to 'encoded in a lovely fist of Morse using an invisible pattern of spaces and tabs'.

I don't know what models are capable of doing these days, but I find all of these things to be plausible. I just asked ChatGPT to do this and it claimed it had; it even wrote me a beautiful little Python decoder that then only succeeded in decoding one word. That isn't necessarily confirmation, but I'm going to take that as a moral victory.

replies(1): >>45697608 #

tptacek ◴[24 Oct 25 18:26 UTC] No.45697608[source]▶

>>45696880 #

I don't understand this concern. The models themselves are completely inscrutable, of course. But the premise of safely using them in real codebases is that you know what safe code in that language looks like; it's no different than merging a PR from an anonymous contributor on an open source project (except that the anonymous contributor very definitely could be trying to sabotage you and the LLM is almost certainly not).

Either way: if you're not sure what the code does, you don't merge it.

replies(1): >>45699599 #

matthewdgreen ◴[24 Oct 25 22:08 UTC] No.45699599[source]▶

>>45697608 #

The premise of TFA as understood it was that we have lethal trifecta risk: sensitive data getting exfiltrated via coding agent. The two solutions were sandboxing to limit access to sensitive data (or just running the agent on somebody else’s machine) and sandboxing to block outbound network connections. My only point here is that once you’ve accepted the risk that the model has been rendered malicious by prompt injection, locking down the network is totally insufficient. As long as you plan to release the code publicly (or perhaps just run it on a machine that has network access), it has an almost disturbingly exciting number of ways it can do data exfiltration via the code. And human code review is unlikely to find many of them, because the number of possibilities for obfuscation is so huge you’ve lost even if you have an amazing code reviewer (and let’s be honest, at 7000 SloC/day nobody is a great code reviewer.)

I think this is exciting and if I was teaching an intro security and privacy course I’d be urging my students to come up with the most exciting ideas for exfiltrating data, and having others trying to detect it through manual and AI review. I’m pretty sure the attackers would all win, but it’d be exciting either way.

replies(1): >>45699793 #

tptacek ◴[24 Oct 25 22:33 UTC] No.45699793[source]▶

>>45699599 #

Huh. I don't know if I'm being too jumpy about this or not.

The notion that Claude in yolo-mode, given access to secrets in its execution environment, might exfil them is a real concern. Unsupervised agents will do wild things in the process of trying to problem-solve. If that's the concern: I get it.

The notion that the code Claude produces through this process might exfil its users secrets when they use the code is not well-founded. At the end of whatever wild-ass process Claude undertakes, you're going to get an artifact (probably a PR). It's your job to review the PR.

The claim I understood you to be making is that reviewing such a PR is an intractable problem. But no it isn't. It's a problem developers solve all the time.

But I may have misunderstood your argument!

replies(1): >>45700428 #

matthewdgreen ◴[25 Oct 25 00:26 UTC] No.45700428[source]▶

>>45699793 #

The threat model described in TFA is that someone convinces your agent via prompt injection to exfiltrate secrets. The simple way to do this is to make an outbound network connection (posting with curl or something) but it’s absolutely possible to tell a model to exfiltrate in other ways. Including embedding the secret in a Unicode string that the code itself delivers to outside users when run. If we weren’t living in science fiction land I’d say “no way this works” but we (increasingly) do so of course it does.

replies(2): >>45700501 #>>45700556 #

1. tptacek ◴[25 Oct 25 00:58 UTC] No.45700556[source]▶

>>45700428 #

Yeah, ok! Sounds legit! I just misread what you were saying.

↑