←back to thread

171 points abirag | 1 comments | | HN request time: 0.214s | source
Show context
tadfisher ◴[] No.45308140[source]
Is anyone working on the instruction/data-conflation problem? We're extremely premature in hooking up LLMs to real data sources and external functions if we can't keep them from following instructions in the data. Notion in particular shows absolutely zero warnings to end users, and encourages them to connect GitHub, GMail, Jira, etc. to the model. At this point it's basically criminal to treat this as a feature of a secure product.
replies(5): >>45308229 #>>45309698 #>>45310081 #>>45310871 #>>45315110 #
mcapodici ◴[] No.45309698[source]
The way you worded tbat is good and got me thinking.

What if instead of just lots of text fed to an LLM we have a data structure with trusted and untrusted data.

Any response on a call to a web search or MCP is considered untrusted by default (tunable if you also wrote the MCP and trust it).

The you limit tbe operations on untrusted data to pure transformations, no side effects.

E.g. run an LLM to summarize, or remove whitespace, convert to float etc. All these done in a sandbox without network access.

For example:

"Get me all public github issues on this repo, summarise and store in this DB."

Although the command reads public information untrusted and has DB access it will only process the untrusted information in a tight sandbox and so this can be done securely. I think!

replies(2): >>45311866 #>>45313574 #
1. simonw ◴[] No.45313574[source]
"Get me all public github issues on this repo, summarise and store in this DB."

Yes, this can be done safely.

If you think of it through the "lethal trifecta" framing, to stay safe from data stealing attacks you need to avoid having all three of exposure to untrusted content, exposure to private data and an exfiltration vector.

Here you're actually avoiding two out of them: - there's no private data (just public issue access) and no mechanism that can exfiltrate, so the worst a malicious instruction can do is cause incorrect data to rewritten to your database.

You have to be careful when designing that sandboxed database tool but that's not too hard too get right.