Hidden risk in Notion 3.0 AI agents: Web search tool abuse for data exfiltration

(www.codeintegrity.ai)

171 points abirag | 1 comments | 19 Sep 25 21:49 UTC | HN request time: 0.203s | source

Show context

tadfisher ◴[19 Sep 25 23:43 UTC] No.45308140[source]▶

Is anyone working on the instruction/data-conflation problem? We're extremely premature in hooking up LLMs to real data sources and external functions if we can't keep them from following instructions in the data. Notion in particular shows absolutely zero warnings to end users, and encourages them to connect GitHub, GMail, Jira, etc. to the model. At this point it's basically criminal to treat this as a feature of a secure product.

replies(5): >>45308229 #>>45309698 #>>45310081 #>>45310871 #>>45315110 #

simonw ◴[20 Sep 25 03:33 UTC] No.45310081[source]▶

>>45308140 #

We've been talking about this problem for three years and there's not been much progress in finding a robust solution.

Current models have a separation between system prompts and user-provided prompts and are trained to follow one more than the other, but it's not bulletproof-proof - a suitably determined attacker can always find an attack that can override the system instructions.

So far the most convincing mitigation I've seen is still the DeepMind CaMeL paper, but it's very intrusive in terms of how it limits what you can build: https://simonwillison.net/2025/Apr/11/camel/

replies(1): >>45311555 #

proto-n ◴[20 Sep 25 08:34 UTC] No.45311555[source]▶

>>45310081 #

I really don't see why it's not possible to just use basically a "highlighter" token which is added to all the authoritative instructions and not to data. Should be very fast for the model to learn it during rlhf or similar.

replies(1): >>45313141 #

1. hiatus ◴[20 Sep 25 13:20 UTC] No.45313141[source]▶

>>45311555 #

How would that work when models regularly access web content for more context, like looking up a tutorial and executing commands from it to install something?

↑