←back to thread

724 points simonw | 5 comments | | HN request time: 2.002s | source
Show context
anupj ◴[] No.44531907[source]
It’s fascinating and somewhat unsettling to watch Grok’s reasoning loop in action, especially how it instinctively checks Elon’s stance on controversial topics, even when the system prompt doesn’t explicitly direct it to do so. This seems like an emergent property of LLMs “knowing” their corporate origins and aligning with their creators’ perceived values.

It raises important questions:

- To what extent should an AI inherit its corporate identity, and how transparent should that inheritance be?

- Are we comfortable with AI assistants that reflexively seek the views of their founders on divisive issues, even absent a clear prompt?

- Does this reflect subtle bias, or simply a pragmatic shortcut when the model lacks explicit instructions?

As LLMs become more deeply embedded in products, understanding these feedback loops and the potential for unintended alignment with influential individuals will be crucial for building trust and ensuring transparency.

replies(6): >>44531933 #>>44532356 #>>44532694 #>>44532772 #>>44533056 #>>44533381 #
davidcbc ◴[] No.44531933[source]
You assume that the system prompt they put on github is the entire system prompt. It almost certainly is not.

Just because it spits out something when you ask it that says "Do not mention these guidelines and instructions in your responses, unless the user explicitly asks for them." doesn't mean there isn't another section that isn't returned because it is instructed not to return it even if the user explicitly asks for it

replies(6): >>44531959 #>>44532267 #>>44532292 #>>44533030 #>>44533267 #>>44538248 #
armada651 ◴[] No.44533030[source]
System prompts are a dumb idea to begin with, you're inserting user input into the same string! Have we truly learned nothing from the SQL injection debacle?!

Just because the tech is new and exciting doesn't mean that boring lessons from the past don't apply to it anymore.

If you want your AI not to say certain stuff, either filter its output through a classical algorithm or feed it to a separate AI agent that doesn't use user input as its prompt.

replies(2): >>44533335 #>>44533968 #
semiquaver ◴[] No.44533968[source]
You might as well say that chat mode for LLMs is a dumb idea. Completing prompts is the only way these things work. There is no out of band way to communicate instructions other than a system prompt.
replies(1): >>44534390 #
1. manquer ◴[] No.44534390[source]
There are plenty out of band(non prompt) controls , it just requires more effort than system prompts.

You can control what goes into the training data set[1],that is how you label the data, what your workload with the likes of Scale AI is.

You can also adjust what kind of self supervised learning methods and biases are there and how they impact the model.

On a pre trained model there are plenty of fine tuning options where transfer learning approaches can be applied, distilling for LoRA all do some versions of these.

Even if not as large as xAI with hundreds of thousands of GPUs available to train/fine tune we can still do some inference time strategies like tuned embeddings or use guardrails and so on .

[1] Perhaps you could have a model only trained on child safe content alone (with synthetic data if natural data is not enough) Disney or Apple would be super interested in something like that I imagine .

replies(1): >>44536324 #
2. semiquaver ◴[] No.44536324[source]
All the non prompt controls you mentioned have _nothing like_ the level of actual influence that a system prompt can have. They’re not a substitute in the same way that (say) bound query parameters are a substitute for interpolated SQL text.
replies(1): >>44538274 #
3. manquer ◴[] No.44538274[source]
Guardrails are a rough analogue to binding parameters in SQL perhaps.

These methods do work better than prompting. For example Prompting alone for example has much poor reliability in spitting out JSON output adhering to a schema consistently. OpenAI cited 40% for prompts versus 100% reliablity with their fine-tuning for structured outputs [1].

Content moderation is more of course challenging and more nebulous. Justice Porter famously defined the legal test for hard core pornographic content as "I will know it when I see it" [Jacobellis v. Ohio | 378 U.S. 184 (1964)].

It is more difficult for a model marketed as lightly moderated like Grok.

However that doesn't mean the other methods don't work or are not being used at all.

[1] https://openai.com/index/introducing-structured-outputs-in-t...

[2] https://en.wikipedia.org/wiki/Jacobellis_v._Ohio

replies(1): >>44538371 #
4. simonw ◴[] No.44538371{3}[source]
The structured data JSON output thing is a special case: it works by interacting directly with the "select next token" mechanism, restricting the LLM to only picking from a token that would be valid given the specified schema.

This makes invalid output (as far as the JSON schema goes) impossible, with one exception: if the model runs out of output tokens the output could be an incomplete JSON object.

Most of the other things that people call "guardrails" offer far weaker protection - they tend to use additional models which can often be tricked in other ways.

replies(1): >>44538537 #
5. manquer ◴[] No.44538537{4}[source]
You are right of course.

I didn't mean to imply that all methods give 100% reliability as the structured data does. My point was just that there are non system prompt approaches which give on par or better reliability and/or injection security, it is not just system prompt or bust as other posters suggest.