←back to thread

171 points abirag | 1 comments | | HN request time: 0.203s | source
Show context
tadfisher ◴[] No.45308140[source]
Is anyone working on the instruction/data-conflation problem? We're extremely premature in hooking up LLMs to real data sources and external functions if we can't keep them from following instructions in the data. Notion in particular shows absolutely zero warnings to end users, and encourages them to connect GitHub, GMail, Jira, etc. to the model. At this point it's basically criminal to treat this as a feature of a secure product.
replies(5): >>45308229 #>>45309698 #>>45310081 #>>45310871 #>>45315110 #
simonw ◴[] No.45310081[source]
We've been talking about this problem for three years and there's not been much progress in finding a robust solution.

Current models have a separation between system prompts and user-provided prompts and are trained to follow one more than the other, but it's not bulletproof-proof - a suitably determined attacker can always find an attack that can override the system instructions.

So far the most convincing mitigation I've seen is still the DeepMind CaMeL paper, but it's very intrusive in terms of how it limits what you can build: https://simonwillison.net/2025/Apr/11/camel/

replies(1): >>45311555 #
proto-n ◴[] No.45311555[source]
I really don't see why it's not possible to just use basically a "highlighter" token which is added to all the authoritative instructions and not to data. Should be very fast for the model to learn it during rlhf or similar.
replies(1): >>45313141 #
1. hiatus ◴[] No.45313141[source]
How would that work when models regularly access web content for more context, like looking up a tutorial and executing commands from it to install something?