Popular/hot comments

(www.anthropic.com)

Show context

dfabulich ◴[27 Aug 25 01:10 UTC] No.45034300[source]▶

Claude for Chrome seems to be walking right into the "lethal trifecta." https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/

"The lethal trifecta of capabilities is:"

• Access to your private data—one of the most common purposes of tools in the first place!

• Exposure to untrusted content—any mechanism by which text (or images) controlled by a malicious attacker could become available to your LLM

• The ability to externally communicate in a way that could be used to steal your data (I often call this “exfiltration” but I’m not confident that term is widely understood.)

If your agent combines these three features, an attacker can easily trick it into accessing your private data and sending it to that attacker.

replies(11): >>45034378 #>>45034587 #>>45034866 #>>45035318 #>>45035331 #>>45036212 #>>45036454 #>>45036497 #>>45036635 #>>45040651 #>>45041262 #

afarviral ◴[27 Aug 25 01:51 UTC] No.45034587[source]▶

>>45034300 #

How would you go about making it more secure but still getting to have your cake too? Off the top my head, could you: a) only ingest text that can be OCRd or somehow determine if it is human readable b) make it so text from the web session is isolated from the model with respect to triggering an action. Then it's simply a tradeoff at that point.

replies(3): >>45034626 #>>45035055 #>>45035249 #

1. kccqzy ◴[27 Aug 25 01:57 UTC] No.45034626[source]▶

>>45034587 #

I think Simon has proposed breaking the lethal trifecta by having two LLMs, where the first has access to untrusted data but cannot do any actions, and the second LLM has privileges but only abstract variables from the first LLM not the content. See https://simonwillison.net/2023/Apr/25/dual-llm-pattern/

It is rather similar to your option (b).

replies(2): >>45035209 #>>45035740 #

2. maximilianthe1 ◴[27 Aug 25 03:38 UTC] No.45035209[source]▶

>>45034626 (TP) #

Can't the attacker then jailbreak the first LLM to generate jailbreak with actions for the second one?

replies(3): >>45035244 #>>45036219 #>>45036522 #

3. arthurcolle ◴[27 Aug 25 03:45 UTC] No.45035244[source]▶

>>45035209 #

Yes they can

replies(1): >>45035434 #

4. ares623 ◴[27 Aug 25 04:22 UTC] No.45035434{3}[source]▶

>>45035244 #

Hmm so we need 3 LLMs

replies(1): >>45035849 #

5. pishpash ◴[27 Aug 25 05:26 UTC] No.45035740[source]▶

>>45034626 (TP) #

That's just an information bottleneck. It doesn't fundamentally change anything.

6. zwnow ◴[27 Aug 25 05:48 UTC] No.45035849{4}[source]▶

>>45035434 #

Doesn't help.

https://gandalf.lakera.ai/baseline

This thing models exactly these scenarios and asks you to break it, its still pretty easy. LLMs are not safe.

7. dfabulich ◴[27 Aug 25 06:47 UTC] No.45036219[source]▶

>>45035209 #

If you read the fine article, you'll see that the approach includes a non-LLM controller managing structured communication between the Privileged LLM (allowed to perform actions) and the Quarantined LLM (only allowed to produce structured data, which is assumed to be tainted).

See also CaMeL https://simonwillison.net/2025/Apr/11/camel/ which incorporates a type system to track tainted data from the Quarantined LLM, ensuring that the Privileged LLM can't even see tainted _data_ until it's been reviewed by a human user. (But this can induce user fatigue as the user is forced to manually approve all the data that the Privileged LLM can access.)

replies(1): >>45046231 #

8. j45 ◴[27 Aug 25 07:31 UTC] No.45036522[source]▶

>>45035209 #

One would have to be relatively invisible.

Non-deterministic security feels like a relatively new area.

9. yencabulator ◴[27 Aug 25 22:42 UTC] No.45046231{3}[source]▶

>>45036219 #

"Structured data" is kind of the wrong description for what Simon proposes. JSON is structured but can smuggle a string with the attack inside it. Simon's proposal is smarter than that.

↑