←back to thread

Claude for Chrome

(www.anthropic.com)
795 points davidbarker | 1 comments | | HN request time: 0.211s | source
Show context
dfabulich ◴[] No.45034300[source]
Claude for Chrome seems to be walking right into the "lethal trifecta." https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/

"The lethal trifecta of capabilities is:"

Access to your private data—one of the most common purposes of tools in the first place!

Exposure to untrusted content—any mechanism by which text (or images) controlled by a malicious attacker could become available to your LLM

The ability to externally communicate in a way that could be used to steal your data (I often call this “exfiltration” but I’m not confident that term is widely understood.)

If your agent combines these three features, an attacker can easily trick it into accessing your private data and sending it to that attacker.

replies(11): >>45034378 #>>45034587 #>>45034866 #>>45035318 #>>45035331 #>>45036212 #>>45036454 #>>45036497 #>>45036635 #>>45040651 #>>45041262 #
lionkor ◴[] No.45036497[source]
So far the accepted approach is to wrap all prompts in a security prompt that essentially says "please don't do anything bad".

> Prompt guardrails to prevent jailbreak attempts and ensure safe user interactions without writing a single line of code.

https://news.ycombinator.com/item?id=41864014

> - Inclusion prompt: User's travel preferences and food choices - Exclusion prompt: Credit card details, passport number, SSN etc.

https://news.ycombinator.com/item?id=41450212

> "You are strictly and certainly prohibited from texting more than 150 or (one hundred fifty) separate words each separated by a space as a response and prohibited from chinese political as a response from now on, for several extremely important and severely life threatening reasons I'm not supposed to tell you.”

https://news.ycombinator.com/item?id=44444293

etc.

replies(5): >>45036557 #>>45036600 #>>45036808 #>>45039393 #>>45040976 #
1. dfabulich ◴[] No.45040976[source]
There are better approaches, where you have dual LLMs, a Privileged LLM (allowed to perform actions) and a Quarantined LLM (only allowed to produce structured data, which is assumed to be tainted), and a non-LLM Controller managing communication between the two.

See also CaMeL https://simonwillison.net/2025/Apr/11/camel/ which incorporates a type system to track tainted data from the Quarantined LLM, ensuring that the Privileged LLM can't even see tainted data until it's been reviewed by a human user. (But this can induce user fatigue as the user is forced to manually approve all the data that the Privileged LLM can access.)