←back to thread

645 points helloplanets | 1 comments | | HN request time: 0s | source
Show context
ec109685 ◴[] No.45005397[source]
It’s obviously fundamentally unsafe when Google, OpenAI and Anthropic haven’t released the same feature and instead use a locked down VM with no cookies to browse the web.

LLM within a browser that can view data across tabs is the ultimate “lethal trifecta”.

Earlier discussion: https://news.ycombinator.com/item?id=44847933

It’s interesting that in Brave’s post describing this exploit, they didn’t reach the fundamental conclusion this is a bad idea: https://brave.com/blog/comet-prompt-injection/

Instead they believe model alignment, trying to understand when a user is doing a dangerous task, etc. will be enough. The only good mitigation they mention is that the agent should drop privileges, but it’s just as easy to hit an attacker controlled image url to leak data as it is to send an email.

replies(7): >>45005444 #>>45005853 #>>45006130 #>>45006210 #>>45006263 #>>45006384 #>>45006571 #
snet0 ◴[] No.45005853[source]
> Instead they believe model alignment, trying to understand when a user is doing a dangerous task, etc. will be enough.

Maybe I have a fundamental misunderstanding, but I feel like hoping that model alignment and in-model guardrails are statistical preventions, ie you'll reduce the odds to some number of zeroes preceeding the 1. These things should literally never be able to happen, though. It's a fools errand to hope that you'll get to a model where there is no value in the input space that maps to <bad thing you really don't want>. Even if you "stack" models, having a safety-check model act on the output of your larger model, you're still just multiplying odds.

replies(5): >>45006201 #>>45006251 #>>45006358 #>>45007218 #>>45007846 #
anzumitsu ◴[] No.45006358[source]
To play devils advocate, isn’t any security approach fundamentally statistical because we exist in the real world, not the abstract world of security models, programming language specifications, and abstract machines? There’s always going to be a chance of a compiler bug, a runtime error, a programmer error, a security flaw in a processor, whatever.

Now, personally I’d still rather take the approach that at least attempts to get that probability to zero through deterministic methods than leave it up to model alignment. But it’s also not completely unthinkable to me that we eventually reach a place where the probability of a misaligned model is sufficiently low to be comparable to the probability of an error occurring in your security model.

replies(4): >>45006455 #>>45007043 #>>45007799 #>>45016698 #
1. ec109685 ◴[] No.45006455[source]
The fact that every single system prompt has been leaked despite guidelines to the LLM that it should protect it, shows that without “physical” barriers, you are aren’t providing any security guarantees.

A user of chrome can know, barring bugs that are definitively fixable, that a comment on a reddit post can’t read information from their bank.

If an LLM with user controlled input has access to both domains, it will never be secure until alignment becomes perfect, which there is no current hope to achieve.

And if you think about a human in the driver seat instead of an LLM trying to make these decisions, it’d be easy for a sophisticated attacker to trick humans to leak data, so it’s probably impossible to align it this way.