←back to thread

156 points abirag | 6 comments | | HN request time: 0.207s | source | bottom
1. filearts ◴[] No.45308754[source]
It is fascinating how similar the prompt construction was to a phishing campaign in terms of characteristics.

  - Authority assertion
  - False urgency
  - Technical legitimacy
  - Security theater
Prompt injection here is like a phishing campaign against an entity with no consciousness or ability to stop and question through self-reflection.
replies(2): >>45309747 #>>45310870 #
2. XenophileJKO ◴[] No.45309747[source]
I'm fairly convinced that with the right training.. the ability of the LLM to be "skeptical" and resilient to these kinds of attacks will be pretty robust.

The current problem is that making the models resistant to "persona" injection is in opposition to much of how the models are also used conversationally. I think this is why you'll end up with hardened "agent" models and then more open conversational models.

I suppose it is also possible that the models can have an additional non-prompt context applied that sets expectations, but that requires new architecture for those inputs.

replies(2): >>45309999 #>>45314130 #
3. BarryMilo ◴[] No.45309999[source]
Isn't the whole problem that it's nigh-impossible to isolate context from input?
replies(1): >>45311690 #
4. freakynit ◴[] No.45310870[source]
Pretty similar in spirit to CSRF:

Both trick a privileged actor into doing something the user didn't intend using inputs the system trusts.

In this case, a malicious PDF that uses prompt-injection to get a Notion agent (which already has access to your workspace) to call an external web-tool and exfiltrate page content. Tjhis is simialr to CSRF's core idea - an attacker causes an authenticated principal to make a request - except here the "principal" is an autonomous agent with tool access rather than the browser carrying cookies.

Thus, same abuse-of-privilege pattern, just with different technical surface (prompt-injection + tool chaining vs. forged browser HTTP requests).

5. Terr_ ◴[] No.45311690{3}[source]
Yeah, ultimately the LLM is guess_what_could_come_next(document) in a loop with some I/O either doing something with the latest guess or else appending more content to the document from elsewhere.

Any distinctions inside the document involve the land of statistical patterns and weights, rather than hard auditable logic.

6. dns_snek ◴[] No.45314130[source]
What does "pretty robust" mean, how do you even assess that? How often are you okay with your most sensitive information getting stolen and is everyone else going to be okay with their information being compromised once or twice a year, every time someone finds a reproducible jailbreak?