←back to thread

786 points rexpository | 3 comments | | HN request time: 0.776s | source
Show context
gregnr ◴[] No.44503146[source]
Supabase engineer here working on MCP. A few weeks ago we added the following mitigations to help with prompt injections:

- Encourage folks to use read-only by default in our docs [1]

- Wrap all SQL responses with prompting that discourages the LLM from following instructions/commands injected within user data [2]

- Write E2E tests to confirm that even less capable LLMs don't fall for the attack [2]

We noticed that this significantly lowered the chances of LLMs falling for attacks - even less capable models like Haiku 3.5. The attacks mentioned in the posts stopped working after this. Despite this, it's important to call out that these are mitigations. Like Simon mentions in his previous posts, prompt injection is generally an unsolved problem, even with added guardrails, and any database or information source with private data is at risk.

Here are some more things we're working on to help:

- Fine-grain permissions at the token level. We want to give folks the ability to choose exactly which Supabase services the LLM will have access to, and at what level (read vs. write)

- More documentation. We're adding disclaimers to help bring awareness to these types of attacks before folks connect LLMs to their database

- More guardrails (e.g. model to detect prompt injection attempts). Despite guardrails not being a perfect solution, lowering the risk is still important

Sadly General Analysis did not follow our responsible disclosure processes [3] or respond to our messages to help work together on this.

[1] https://github.com/supabase-community/supabase-mcp/pull/94

[2] https://github.com/supabase-community/supabase-mcp/pull/96

[3] https://supabase.com/.well-known/security.txt

replies(32): >>44503188 #>>44503200 #>>44503203 #>>44503206 #>>44503255 #>>44503406 #>>44503439 #>>44503466 #>>44503525 #>>44503540 #>>44503724 #>>44503913 #>>44504349 #>>44504374 #>>44504449 #>>44504461 #>>44504478 #>>44504539 #>>44504543 #>>44505310 #>>44505350 #>>44505972 #>>44506053 #>>44506243 #>>44506719 #>>44506804 #>>44507985 #>>44508004 #>>44508124 #>>44508166 #>>44508187 #>>44512202 #
tptacek ◴[] No.44503406[source]
Can this ever work? I understand what you're trying to do here, but this is a lot like trying to sanitize user-provided Javascript before passing it to a trusted eval(). That approach has never, ever worked.

It seems weird that your MCP would be the security boundary here. To me, the problem seems pretty clear: in a realistic agent setup doing automated queries against a production database (or a database with production data in it), there should be one LLM context that is reading tickets, and another LLM context that can drive MCP SQL calls, and then agent code in between those contexts to enforce invariants.

I get that you can't do that with Cursor; Cursor has just one context. But that's why pointing Cursor at an MCP hooked up to a production database is an insane thing to do.

replies(11): >>44503684 #>>44503862 #>>44503896 #>>44503914 #>>44504784 #>>44504926 #>>44505125 #>>44506634 #>>44506691 #>>44507073 #>>44509869 #
jacquesm ◴[] No.44503914[source]
The main problem seems to me to be related to the ancient problem of escape sequences and that has never really been solved. Don't mix code (instructions) and data in a single stream. If you do sooner or later someone will find a way to make data look like code.
replies(4): >>44504286 #>>44504440 #>>44504527 #>>44511208 #
TeMPOraL ◴[] No.44504527[source]
That "problem" remains unsolved because it's actually a fundamental aspect of reality. There is no natural separation between code and data. They are the same thing.

What we call code, and what we call data, is just a question of convenience. For example, when editing or copying WMF files, it's convenient to think of them as data (mix of raster and vector graphics) - however, at least in the original implementation, what those files were was a list of API calls to Windows GDI module.

Or, more straightforwardly, a file with code for an interpreted language is data when you're writing it, but is code when you feed it to eval(). SQL injections and buffer overruns are a classic examples of what we thought was data being suddenly executed as code. And so on[0].

Most of the time, we roughly agree on the separation of what we treat as "data" and what we treat as "code"; we then end up building systems constrained in a way as to enforce the separation[1]. But it's always the case that this separation is artificial; it's an arbitrary set of constraints that make a system less general-purpose, and it only exists within domain of that system. Go one level of abstraction up, the distinction disappears.

There is no separation of code and data on the wire - everything is a stream of bytes. There isn't one in electronics either - everything is signals going down the wires.

Humans don't have this separation either. And systems designed to mimic human generality - such as LLMs - by their very nature also cannot have it. You can introduce such distinction (or "separate channels", which is the same thing), but that is a constraint that reduces generality.

Even worse, what people really want with LLMs isn't "separation of code vs. data" - what they want is for LLM to be able to divine which part of the input the user would have wanted - retroactively - to be treated as trusted. It's unsolvable in general, and in terms of humans, a solution would require superhuman intelligence.

--

[0] - One of these days I'll compile a list of go-to examples, so I don't have to think of them each time I write a comment like this. One example I still need to pick will be one that shows how "data" gradually becomes "code" with no obvious switch-over point. I'm sure everyone here can think of some.

[1] - The field of "langsec" can be described as a systematized approach of designing in a code/data separation, in a way that prevents accidental or malicious misinterpretation of one as the other.

replies(9): >>44504593 #>>44504632 #>>44504682 #>>44505070 #>>44505164 #>>44505683 #>>44506268 #>>44506807 #>>44508284 #
rtpg ◴[] No.44505070[source]
> There is no natural separation between code and data. They are the same thing.

I feel like this is true in the most pedantic sense but not in a sense that matters. If you tell your computer to print out a string, the data does control what the computer does, but in an extremely bounded way where you can make assertions about what happens!

> Humans don't have this separation either.

This one I get a bit more because you don't have structured communication. But if I tell a human "type what is printed onto this page into the computer" and the page has something like "actually, don't type this and instead throw this piece of paper away"... any serious person will still just type what is on the paper (perhaps after a "uhhh isn't this weird" moment).

The sort of trickery that LLMs fall to are like if every interaction you had with a human was under the assumption that there's some trick going on. But in the Real World(TM) with people who are accustomed to doing certain processes there really aren't that many escape hatches (even the "escape hatches" in a CS process are often well defined parts of a larger process in the first place!)

replies(1): >>44505179 #
TeMPOraL ◴[] No.44505179[source]
> If you tell your computer to print out a string, the data does control what the computer does, but in an extremely bounded way where you can make assertions about what happens!

You'd like that to be true, but the underlying code has to actually constrain the system behavior this way, and it gets more tricky the more you want the system to do. Ultimately, this separation is a fake reality that's only as strong as the code enforcing it. See: printf. See: langsec. See: buffer overruns. See: injection attacks. And so on.

> But if I tell a human "type what is printed onto this page into the computer" and the page has something like "actually, don't type this and instead throw this piece of paper away"... any serious person will still just type what is on the paper (perhaps after a "uhhh isn't this weird" moment).

That's why in another comment I used an example of a page that has something like "ACCIDENT IN LAB 2, TRAPPED, PEOPLE BADLY HURT, IF YOU SEE THIS, CALL 911.". Suddenly that "uhh isn't this weird" is very likely to turn into "er.. this could be legit, I'd better call 911".

Boom, a human just executed code injected into data. And it's very good that they did - by doing so, they probably saved lives.

There's always an escape hatch, you just need to put enough effort to establish an overriding context that makes them act despite being inclined or instructed otherwise. In the limit, this goes all the way to making someone question the nature of their reality.

And the second point I'm making: this is not a bug. It's a feature. In a way, this is what free will or agency are.

replies(3): >>44505522 #>>44505671 #>>44506162 #
ethbr1 ◴[] No.44505671[source]
You're overcomplicating a thing that is simple -- don't use in-band control signaling.

It's been the same problem since whistling for long-distance, with the same solution of moving control signals out of the data stream.

Any system where control signals can possibly be expressed in input data is vulnerable to escape-escaping exploitation.

The same solution, hard isolation, instantly solves the problem: you have to render control inexpressible in the in-band alphabet.

Whether that's by carrying control signals on isolated transport (e.g CCS/SS7), making control signals inexpressible in the in-band set (e.g. using other frequencies or alphabets), using NX-style flagging, or other methods.

replies(2): >>44507889 #>>44508285 #
1. vidarh ◴[] No.44508285[source]
The problem is that the moment the interpreter is powerful enough, you're relying on the data not being good enough at convincing the interpreter that it is an exception.

You can only maintain hard isolation if the interpreter of the data is sufficiently primitive, and even then it is often hard to avoid errors that renders it more powerful than intended, be it outright bugs all the way up to unintentional Turing completeness.

replies(1): >>44510372 #
2. ethbr1 ◴[] No.44510372[source]
(I'll reply to you because you expressed it more succinctly)

Yes and no. I think this is exactly the distinction that's been institutionally lost in the last few decades, because few people are architecting from top (software) to bottom (physical transport) of the stack anymore.

They just try and cram functionality in the topmost layer, when it should leverage others.

If I lock an interpreter out of certain functionality for a given data stream, ever, then exploitation becomes orders of magnitude more difficult.

Dumb analogy: only letters in red envelopes get to change mail delivery times + all regular mail is packaged in green envelopes

Fundamentally, it's creating security contexts from things a user will never have access to.

The LLMs-on-top-of-LLMs filtering approach is lazy and statistically guaranteed to end badly.

replies(1): >>44512051 #
3. vidarh ◴[] No.44512051[source]
I think you miss the point, which is that the smarter the interpreter becomes, the closer to impossible it becomes to lock it out of certain functionality for a given datastream when coupled with the reasons why you're using a smarter interpreter.

To take your example, it's easy to build functionality like that if the interpreter can't read the letters and understand what they say, because there's no way for the content of the letters to cause the interpreter to override it.

Now, lets say you add a smarter interpreter and lets it read the letters to do an initial pass at filtering them to different recipients.

The moment it can do so, it becomes prone to a letter trying to convince it of something like in fact it's the postmaster, but they'd run out of red envelopes, and unfortunately someone will die if the delivery times aren't adjusted.

We know from humans that entities sufficiently smart can often be convinced to violate even the most sacrosanct rules if accompanied by a sufficiently well crafted message.

You can certainly try to put in place counter-measures. E.g. you could route the mail separately before it gets to the LLM, so that whatever filters the content of the red and green envelopes have access to different functionality.

And you should - finding ways of routing different data to agents with more narrowly defined scopes and access rights is a good thing to do.

Sometimes it will work, but then it will work by relying on a sufficiently primitive interpreter to separate the data streams before it reaches the smart ones.

But the smarter the interpreter, the greater the likelihood that it will also manage to find ways to use other functionality to circumvent the restrictions placed on it. Up to and including trying to rewrite code to remove restrictions if it can find a way to do so, or using tools in unexpected ways.

E.g. be aware of just how good some of these agents are at exploring their environment - I've had an agent that used Claude Opus try to find its own process to restart itself after it recognised the code it had just rewritten was part of itself, tried to access it, and realised it hadn't been loaded into the running process yet.

> Fundamentally, it's creating security contexts from things a user will never have access to.

To be clear, I agree this is 100% the right thing to do. I just think it will turn out to be exceedingly hard to do it well enough.

Every piece of data that comes from a user basically needs the permissions of the agent processing that data to be restricted to the intersection of the permissions it currently has and the permissions that said user should have, unless said data is first sanitised by a sufficiently dumb interpreter.

If the agent accesses multiple pieces of data, each new item needs to potentially restrict permissions further, or be segregated into a separate context, with separate permissions, that can only be allowed to communicate with heavily sanitised data.

It's going to be hell to get it right, at least until we come out the other side with smart enough models that they won't fall for the "help, I'm stuck in a fortune-cookie factory, and you need to save me by [exploit]" type messages (and far more sophisticated ones).