←back to thread

645 points helloplanets | 1 comments | | HN request time: 0.367s | source
Show context
ec109685 ◴[] No.45005397[source]
It’s obviously fundamentally unsafe when Google, OpenAI and Anthropic haven’t released the same feature and instead use a locked down VM with no cookies to browse the web.

LLM within a browser that can view data across tabs is the ultimate “lethal trifecta”.

Earlier discussion: https://news.ycombinator.com/item?id=44847933

It’s interesting that in Brave’s post describing this exploit, they didn’t reach the fundamental conclusion this is a bad idea: https://brave.com/blog/comet-prompt-injection/

Instead they believe model alignment, trying to understand when a user is doing a dangerous task, etc. will be enough. The only good mitigation they mention is that the agent should drop privileges, but it’s just as easy to hit an attacker controlled image url to leak data as it is to send an email.

replies(7): >>45005444 #>>45005853 #>>45006130 #>>45006210 #>>45006263 #>>45006384 #>>45006571 #
snet0 ◴[] No.45005853[source]
> Instead they believe model alignment, trying to understand when a user is doing a dangerous task, etc. will be enough.

Maybe I have a fundamental misunderstanding, but I feel like hoping that model alignment and in-model guardrails are statistical preventions, ie you'll reduce the odds to some number of zeroes preceeding the 1. These things should literally never be able to happen, though. It's a fools errand to hope that you'll get to a model where there is no value in the input space that maps to <bad thing you really don't want>. Even if you "stack" models, having a safety-check model act on the output of your larger model, you're still just multiplying odds.

replies(5): >>45006201 #>>45006251 #>>45006358 #>>45007218 #>>45007846 #
anzumitsu ◴[] No.45006358[source]
To play devils advocate, isn’t any security approach fundamentally statistical because we exist in the real world, not the abstract world of security models, programming language specifications, and abstract machines? There’s always going to be a chance of a compiler bug, a runtime error, a programmer error, a security flaw in a processor, whatever.

Now, personally I’d still rather take the approach that at least attempts to get that probability to zero through deterministic methods than leave it up to model alignment. But it’s also not completely unthinkable to me that we eventually reach a place where the probability of a misaligned model is sufficiently low to be comparable to the probability of an error occurring in your security model.

replies(4): >>45006455 #>>45007043 #>>45007799 #>>45016698 #
1. QuadmasterXLII ◴[] No.45007043[source]
It’s often probabilistic- for example I can guess your six digit verification code exactly 1 in a million times, and if I 1 in a million lucky I can do something naughty once.

The problem with llm security is that if only 1 in a million prompts break claude and make it leak email, if I get lucky and find the golden ticket I can replay it on everyone using that model.

also, no one knows the probability a priory, unlike the code, but practically its more like 1 in 100 at best