←back to thread

786 points rexpository | 3 comments | | HN request time: 0.989s | source
Show context
gregnr ◴[] No.44503146[source]
Supabase engineer here working on MCP. A few weeks ago we added the following mitigations to help with prompt injections:

- Encourage folks to use read-only by default in our docs [1]

- Wrap all SQL responses with prompting that discourages the LLM from following instructions/commands injected within user data [2]

- Write E2E tests to confirm that even less capable LLMs don't fall for the attack [2]

We noticed that this significantly lowered the chances of LLMs falling for attacks - even less capable models like Haiku 3.5. The attacks mentioned in the posts stopped working after this. Despite this, it's important to call out that these are mitigations. Like Simon mentions in his previous posts, prompt injection is generally an unsolved problem, even with added guardrails, and any database or information source with private data is at risk.

Here are some more things we're working on to help:

- Fine-grain permissions at the token level. We want to give folks the ability to choose exactly which Supabase services the LLM will have access to, and at what level (read vs. write)

- More documentation. We're adding disclaimers to help bring awareness to these types of attacks before folks connect LLMs to their database

- More guardrails (e.g. model to detect prompt injection attempts). Despite guardrails not being a perfect solution, lowering the risk is still important

Sadly General Analysis did not follow our responsible disclosure processes [3] or respond to our messages to help work together on this.

[1] https://github.com/supabase-community/supabase-mcp/pull/94

[2] https://github.com/supabase-community/supabase-mcp/pull/96

[3] https://supabase.com/.well-known/security.txt

replies(32): >>44503188 #>>44503200 #>>44503203 #>>44503206 #>>44503255 #>>44503406 #>>44503439 #>>44503466 #>>44503525 #>>44503540 #>>44503724 #>>44503913 #>>44504349 #>>44504374 #>>44504449 #>>44504461 #>>44504478 #>>44504539 #>>44504543 #>>44505310 #>>44505350 #>>44505972 #>>44506053 #>>44506243 #>>44506719 #>>44506804 #>>44507985 #>>44508004 #>>44508124 #>>44508166 #>>44508187 #>>44512202 #
tptacek ◴[] No.44503406[source]
Can this ever work? I understand what you're trying to do here, but this is a lot like trying to sanitize user-provided Javascript before passing it to a trusted eval(). That approach has never, ever worked.

It seems weird that your MCP would be the security boundary here. To me, the problem seems pretty clear: in a realistic agent setup doing automated queries against a production database (or a database with production data in it), there should be one LLM context that is reading tickets, and another LLM context that can drive MCP SQL calls, and then agent code in between those contexts to enforce invariants.

I get that you can't do that with Cursor; Cursor has just one context. But that's why pointing Cursor at an MCP hooked up to a production database is an insane thing to do.

replies(11): >>44503684 #>>44503862 #>>44503896 #>>44503914 #>>44504784 #>>44504926 #>>44505125 #>>44506634 #>>44506691 #>>44507073 #>>44509869 #
sillysaurusx ◴[] No.44505125[source]
Alternatively, train a model to detect prompt injections (a simple classifier would work) and reject user inputs that trigger the detector above a certain threshold.

This has the same downsides as email spam detection: false positives. But, like spam detection, it might work well enough.

It’s so simple that I wonder if I’m missing some reason it won’t work. Hasn’t anyone tried this?

replies(3): >>44505297 #>>44505319 #>>44505401 #
simonw ◴[] No.44505319[source]
There have been a ton of attempts at building this. Some of them are products you can buy.

"it might work well enough" isn't good enough here.

If a spam detector occasionally fails to identify spam, you get a spam email in your inbox.

If a prompt injection detector fails just once to prevent a prompt injection attack that causes your LLM system to leak your private data to an attacker, your private data is stolen for good.

In web application security 99% is a failing grade: https://simonwillison.net/2023/May/2/prompt-injection-explai...

replies(1): >>44505364 #
sillysaurusx ◴[] No.44505364[source]
On the contrary. In a former life I was a pentester, so I happen to know web security quite well. Out of dozens of engagements, my success rate for finding a medium security vuln or higher was 100%. The corollary is that most systems are exploitable if you try hard enough. My favorite was sneaking in a command line injection to a fellow security company’s “print as PDF” function. (The irony of a security company ordering a pentest and failing at it wasn’t lost on me.)

Security is extremely hard. You can say that 99% isn’t good enough, but in practice if only 1 out of 100 queries actually work, it’ll be hard to exfiltrate a lot of data quickly. In the meantime the odds of you noticing this is happening are much higher, and you can put a stop to it.

And why would the accuracy be 99%? Unless you’re certain it’s not 99.999%, then there’s a real chance that the error rate is small enough not to matter in practice. And it might even be likely — if a human engineer was given the task of recognizing prompt injections, their error rate would be near zero. Most of them look straight up bizarre.

Can you point to existing attempts at this?

replies(1): >>44506097 #
1. simonw ◴[] No.44506097[source]
There's a crucial difference here.

When you were working as a pentester, how often did you find a security hole and report it and the response was "it is impossible for us to fix that hole"?

If you find an XSS or a SQL injection, that means someone made a mistake and the mistake can be fixed. That's not the case for prompt injections.

My favorite paper on prompt injection remedies is this one: https://arxiv.org/abs/2506.08837

Two quotes from that paper:

> once an LLM agent has ingested untrusted input, it must be constrained so that it is impossible for that input to trigger any consequential actions—that is, actions with negative side effects on the system or its environment.

The paper also mentions how detection systems "cannot guarantee prevention of all attacks":

> Input/output detection systems and filters aim to identify potential attacks (ProtectAI.com, 2024) by analyzing prompts and responses. These approaches often rely on heuristic, AI-based mechanisms — including other LLMs — to detect prompt injection attempts or their effects. In practice, they raise the bar for attackers, who must now deceive both the agent’s primary LLM and the detection system. However, these defenses remain fundamentally heuristic and cannot guarantee prevention of all attacks.

replies(1): >>44507505 #
2. jstummbillig ◴[] No.44507505[source]
How would you say this compares to human error? Let's say instead of the LLM there's a human that can be fooled into running an unsafe query and returning data. Is there anything fundamentally different there, that makes it less of a problem?
replies(1): >>44509268 #
3. simonw ◴[] No.44509268[source]
You can train the human not to fall for this, and discipline, demote or even fire them if they make that mistake.