Most active commenters
  • TeMPOraL(4)
  • andy99(3)
  • anonymars(3)

←back to thread

780 points rexpository | 15 comments | | HN request time: 0.002s | source | bottom
Show context
gregnr ◴[] No.44503146[source]
Supabase engineer here working on MCP. A few weeks ago we added the following mitigations to help with prompt injections:

- Encourage folks to use read-only by default in our docs [1]

- Wrap all SQL responses with prompting that discourages the LLM from following instructions/commands injected within user data [2]

- Write E2E tests to confirm that even less capable LLMs don't fall for the attack [2]

We noticed that this significantly lowered the chances of LLMs falling for attacks - even less capable models like Haiku 3.5. The attacks mentioned in the posts stopped working after this. Despite this, it's important to call out that these are mitigations. Like Simon mentions in his previous posts, prompt injection is generally an unsolved problem, even with added guardrails, and any database or information source with private data is at risk.

Here are some more things we're working on to help:

- Fine-grain permissions at the token level. We want to give folks the ability to choose exactly which Supabase services the LLM will have access to, and at what level (read vs. write)

- More documentation. We're adding disclaimers to help bring awareness to these types of attacks before folks connect LLMs to their database

- More guardrails (e.g. model to detect prompt injection attempts). Despite guardrails not being a perfect solution, lowering the risk is still important

Sadly General Analysis did not follow our responsible disclosure processes [3] or respond to our messages to help work together on this.

[1] https://github.com/supabase-community/supabase-mcp/pull/94

[2] https://github.com/supabase-community/supabase-mcp/pull/96

[3] https://supabase.com/.well-known/security.txt

replies(31): >>44503188 #>>44503200 #>>44503203 #>>44503206 #>>44503255 #>>44503406 #>>44503439 #>>44503466 #>>44503525 #>>44503540 #>>44503724 #>>44503913 #>>44504349 #>>44504374 #>>44504449 #>>44504461 #>>44504478 #>>44504539 #>>44504543 #>>44505310 #>>44505350 #>>44505972 #>>44506053 #>>44506243 #>>44506719 #>>44506804 #>>44507985 #>>44508004 #>>44508124 #>>44508166 #>>44508187 #
tptacek ◴[] No.44503406[source]
Can this ever work? I understand what you're trying to do here, but this is a lot like trying to sanitize user-provided Javascript before passing it to a trusted eval(). That approach has never, ever worked.

It seems weird that your MCP would be the security boundary here. To me, the problem seems pretty clear: in a realistic agent setup doing automated queries against a production database (or a database with production data in it), there should be one LLM context that is reading tickets, and another LLM context that can drive MCP SQL calls, and then agent code in between those contexts to enforce invariants.

I get that you can't do that with Cursor; Cursor has just one context. But that's why pointing Cursor at an MCP hooked up to a production database is an insane thing to do.

replies(11): >>44503684 #>>44503862 #>>44503896 #>>44503914 #>>44504784 #>>44504926 #>>44505125 #>>44506634 #>>44506691 #>>44507073 #>>44509869 #
jacquesm ◴[] No.44503914[source]
The main problem seems to me to be related to the ancient problem of escape sequences and that has never really been solved. Don't mix code (instructions) and data in a single stream. If you do sooner or later someone will find a way to make data look like code.
replies(4): >>44504286 #>>44504440 #>>44504527 #>>44511208 #
1. cyanydeez ◴[] No.44504286[source]
Others Have pointed out one would need to train a new model that separated code and data because none of the models have any idea what either is.

It probably boils down a determistic and non deterministic problem set, like a compiler vs a interpretor.

replies(1): >>44504342 #
2. andy99 ◴[] No.44504342[source]
You'd need a different architecture, not just training. They already train LLMs to separate instructions and data, to the best of their ability. But an LLM is a classifier, there's some input that adversarrially forces a particular class prediction.

The analogy I like is it's like a keyed lock. If it can let a key in, it can let an attackers pick in - you can have traps and flaps and levers and whatnot, but its operation depends on letting something in there, so if you want it to work you accept that it's only so secure.

replies(1): >>44504631 #
3. TeMPOraL ◴[] No.44504631[source]
The analogy I like is... humans[0].

There's literally no way to separate "code" and "data" for humans. No matter how you set things up, there's always a chance of some contextual override that will make them reinterpret the inputs given new information.

Imagine you get a stack of printouts with some numbers or code, and are tasked with typing them into a spreadsheet. You're told this is all just random test data, but also a trade secret, so you're just to type all that in but otherwise don't interpret it or talk about it outside work. Pretty normal, pretty boring.

You're half-way through, and then suddenly a clean row of data breaks into a message. ACCIDENT IN LAB 2, TRAPPED, PEOPLE BADLY HURT, IF YOU SEE THIS, CALL 911.

What do you do?

Consider how would you behave. Then consider what could your employer do better to make sure you ignore such messages. Then think of what kind of message would make you act on it anyways.

In a fully general system, there's always some way for parts that come later to recontextualize the parts that came before.

--

[0] - That's another argument in favor of anthropomorphising LLMs on a cognitive level.

replies(2): >>44504963 #>>44504992 #
4. jacquesm ◴[] No.44504963{3}[source]
That's a great analogy.
5. anonymars ◴[] No.44504992{3}[source]
> There's literally no way to separate "code" and "data" for humans

It's basically phishing with LLMs, isn't it?

replies(1): >>44505015 #
6. TeMPOraL ◴[] No.44505015{4}[source]
Yes.

I've been saying it ever since 'simonw coined the term "prompt injection" - prompt injection attacks are the LLM equivalent of social engineering, and the two are fundamentally the same thing.

replies(1): >>44505138 #
7. andy99 ◴[] No.44505138{5}[source]
> prompt injection attacks are the LLM equivalent of social engineering,

That's anthropomorphizing. Maybe some of the basic "ignore previous instructions" style attacks feel like that, but the category as a whole is just adversarial ML attacks that work because the LLM doesn't have a world model - same as the old attacks adding noise to an image to have it misclassified despite clearly looking the same: https://arxiv.org/abs/1412.6572 (paper from 2014).

Attacks like GCG just add nonsense tokens until the most probably reply to a malicious request is "Sure". They're not social engineering, they rely on the fact that they're manipulating a classifier.

replies(1): >>44505197 #
8. TeMPOraL ◴[] No.44505197{6}[source]
> That's anthropomorphizing.

Yes, it is. I'm strongly in favor of anthropomorphizing LLMs in cognitive terms, because that actually gives you good intuition about their failure modes. Conversely, I believe that the stubborn refusal to entertain an anthropomorphic perspective is what leads to people being consistently surprised by weaknesses of LLMs, and gives them extremely wrong ideas as to where the problems are and what can be done about them.

I've put forth some arguments for this view in other comments in this thread.

replies(2): >>44505206 #>>44505689 #
9. simonw ◴[] No.44505206{7}[source]
My favorite anthropomorphic term to use with respect to this kind of problem is gullibility.

LLMs are gullible. They will follow instructions, but they can very easy fall for instructions that their owner doesn't actually want them to follow.

It's the same as if you hired a human administrative assistant who hands over your company's private data to anyone who calls them up and says "Your boss said I should ask you for this information...".

replies(1): >>44505695 #
10. Xelynega ◴[] No.44505689{7}[source]
Are you not worried that anthropomorphizing them will lead to misinterpreting the failure modes by attributing them to human characteristics, when the failures might not be caused in the same way at all?

Why anthropomorphize if not to dismiss the actual reasons? If the reasons have explanations that can be tied to reality why do we need the fiction?

replies(2): >>44506061 #>>44507922 #
11. Xelynega ◴[] No.44505695{8}[source]
Going a step further, I live in a reality where you can train most people against phishing attacks like that.

How accurate is the comparison if LLMs can't recover from phishing attacks like that and become more resilient?

replies(1): >>44506041 #
12. anonymars ◴[] No.44506041{9}[source]
I'm confused, you said "most".

If anything that to me strengthens the equivalence.

Do you think we will ever be able to stamp out phishing entirely, as long as humans can be tricked into following untrusted instructions by mistake? Is that not an eerily similar problem to the one we're discussing with LLMs?

Edit: rereading, I may have misinterpreted your point - are you agreeing and pointing out that actually LLMs may be worse than people in that regard?

I do think just as with humans we can keep trying to figure out how to train them better, and I also wouldn't be surprised if we end up with a similarly long tail

13. anonymars ◴[] No.44506061{8}[source]
> Are you not worried that anthropomorphizing them will lead to misinterpreting the failure modes by attributing them to human characteristics, when the failures might not be caused in the same way at all?

On the other hand, maybe techniques we use to protect against phishing can indeed be helpful against prompt injection. Things like tagging untrusted sources and adding instructions accordingly (along the lines of, "this email is from an untrusted source, be careful"), limiting privileges (perhaps in response to said "instructions"), etc. Why should we treat an LLM differently from an employee in that way?

I remember an HN comment about project management, that software engineering is creating technical systems to solve problems with constraints, while project management is creating people systems to solve problems with constraints. I found it an insightful metaphor and feel like this situation is somewhat similar.

https://news.ycombinator.com/item?id=40002598

14. andy99 ◴[] No.44507922{8}[source]
Because most people talking about LLMs don't understand how they work so can only function in analogy space. It adds a veneer of intellectualism to what is basically superstition.
replies(1): >>44507928 #
15. TeMPOraL ◴[] No.44507928{9}[source]
We all routinely talk about things we don't fully understand. We have to. That's life.

Whatever flawed analogy you're using, it can be more or less wrong though. My claim is that, to a first approximation, LLMs behave more like people than like regular software, therefore anthropomorphising them gives you better high-level intuition than stubbornly refusing to.