←back to thread

154 points abirag | 10 comments | | HN request time: 0.707s | source | bottom
1. tadfisher ◴[] No.45308140[source]
Is anyone working on the instruction/data-conflation problem? We're extremely premature in hooking up LLMs to real data sources and external functions if we can't keep them from following instructions in the data. Notion in particular shows absolutely zero warnings to end users, and encourages them to connect GitHub, GMail, Jira, etc. to the model. At this point it's basically criminal to treat this as a feature of a secure product.
replies(4): >>45308229 #>>45309698 #>>45310081 #>>45310871 #
2. abirag ◴[] No.45308229[source]
Hey, I’m the author of this exploit. At CodeIntegrity.ai, we’ve built a platform that visualizes each of the control flows and data flows of an agentic AI system connected to tools to accurately assess each of the risks. We also provide runtime guardrails that give control over each of these flows based on your risk tolerance.

Feel free to email me at abi@codeintegrity.ai — happy to share more

3. mcapodici ◴[] No.45309698[source]
The way you worded tbat is good and got me thinking.

What if instead of just lots of text fed to an LLM we have a data structure with trusted and untrusted data.

Any response on a call to a web search or MCP is considered untrusted by default (tunable if you also wrote the MCP and trust it).

The you limit tbe operations on untrusted data to pure transformations, no side effects.

E.g. run an LLM to summarize, or remove whitespace, convert to float etc. All these done in a sandbox without network access.

For example:

"Get me all public github issues on this repo, summarise and store in this DB."

Although the command reads public information untrusted and has DB access it will only process the untrusted information in a tight sandbox and so this can be done securely. I think!

replies(2): >>45311866 #>>45313574 #
4. simonw ◴[] No.45310081[source]
We've been talking about this problem for three years and there's not been much progress in finding a robust solution.

Current models have a separation between system prompts and user-provided prompts and are trained to follow one more than the other, but it's not bulletproof-proof - a suitably determined attacker can always find an attack that can override the system instructions.

So far the most convincing mitigation I've seen is still the DeepMind CaMeL paper, but it's very intrusive in terms of how it limits what you can build: https://simonwillison.net/2025/Apr/11/camel/

replies(1): >>45311555 #
5. jrm4 ◴[] No.45310871[source]
Is anyone working on the "allowing non-root users to run executable code" problem?

well then

6. proto-n ◴[] No.45311555[source]
I really don't see why it's not possible to just use basically a "highlighter" token which is added to all the authoritative instructions and not to data. Should be very fast for the model to learn it during rlhf or similar.
replies(1): >>45313141 #
7. sebastiennight ◴[] No.45311866[source]
You definitely do not need or want to give database access to an LLM-with-scaffolding system to execute the example you provided.

(by database access, I'm assuming you'd be planning to ask the LLM to write SQL code which this system would run)

Instead, you would ask your LLM to create an object containing the structured data about those github issues (ID, title, description, timestamp, etc) and then you would run a separate `storeGitHubIssues()` method that uses prepared statements to avoid SQL injection.

replies(1): >>45312560 #
8. mcapodici ◴[] No.45312560{3}[source]
Yes this. What you said is what I meant.

You could also get the LLM to "vibe code" the SQL. Tbis is somewhat dangerous as the LLM might make mistakes, but the main thing I am talking about hete is how not to be "influenced" by text in data and so be susceptible to that sort of attack.

9. hiatus ◴[] No.45313141{3}[source]
How would that work when models regularly access web content for more context, like looking up a tutorial and executing commands from it to install something?
10. simonw ◴[] No.45313574[source]
"Get me all public github issues on this repo, summarise and store in this DB."

Yes, this can be done safely.

If you think of it through the "lethal trifecta" framing, to stay safe from data stealing attacks you need to avoid having all three of exposure to untrusted content, exposure to private data and an exfiltration vector.

Here you're actually avoiding two out of them: - there's no private data (just public issue access) and no mechanism that can exfiltrate, so the worst a malicious instruction can do is cause incorrect data to rewritten to your database.

You have to be careful when designing that sandboxed database tool but that's not too hard too get right.