←back to thread

724 points simonw | 6 comments | | HN request time: 0.84s | source | bottom
Show context
davedx ◴[] No.44528899[source]
> I think there is a good chance this behavior is unintended!

That's incredibly generous of you, considering "The response should not shy away from making claims which are politically incorrect" is still in the prompt despite the "open source repo" saying it was removed.

Maybe, just maybe, Grok behaves the way it does because its owner has been explicitly tuning it - in the system prompt, or during model training itself - to be this way?

replies(4): >>44529001 #>>44529934 #>>44530772 #>>44532658 #
numeri ◴[] No.44529934[source]
I'm a little shocked at Simon's conclusion here. We have a man who bought an social media website so he could control what's said, and founded an AI lab so he could get a bot that agrees with him, and who has publicly threatened said AI with being replaced if it doesn't change its political views/agree with him.

His company has also been caught adding specific instructions in this vein to its prompt.

And now it's searching for his tweets to guide its answers on political questions, and Simon somehow thinks it could be unintended, emergent behavior? Even if it were, calling this unintended would be completely ignoring higher order system dynamics (a behavior is still intended if models are rejected until one is found that implements the behavior) and the possibility of reinforcement learning to add this behavior.

replies(3): >>44531319 #>>44531668 #>>44532724 #
simonw ◴[] No.44531668[source]
Elon obviously wants Grok to reflect his viewpoints, and has said so multiple times.

I do not think he wants it to openly say "I am now searching for tweets from:elonmusk in order to answer this question". That's plain embarrassing for him.

That's what I meant by "I think there is a good chance this behavior is unintended".

replies(2): >>44532300 #>>44532768 #
1. numeri ◴[] No.44532300[source]
I really like your posts, and they're generally very clearly written. Maybe this one's just the odd duck out, as it's hard for me to find what you actually meant (as clarified in your comment here) in this paragraph:

> This suggests that Grok may have a weird sense of identity—if asked for its own opinions it turns to search to find previous indications of opinions expressed by itself or by its ultimate owner. I think there is a good chance this behavior is unintended!

I'd say it's far more likely that:

1. Elon ordered his research scientists to "fix it" – make it agree with him

2. They did RL (probably just basic tool use training) to encourage checking for Elon's opinions

3. They did not update the UI (for whatever reason – most likely just because research scientists aren't responsible for front-end, so they forgot)

4. Elon is likely now upset that this is shown so obviously

The key difference is that I think it's incredibly unlikely that this is emergent behavior due to an "sense of identity", as opposed to direct efforts of the xAI research team. It's likely also a case of https://en.wiktionary.org/wiki/anticipatory_obedience.

replies(1): >>44532341 #
2. simonw ◴[] No.44532341[source]
That's why I said "I think there is a good chance" - I think what you describe here (anticipatory obedience) is possible too, but I honestly wouldn't be surprised to hear that the from:elonmusk searches genuinely were unintended behavior.

I find this as accidental behavior almost more interesting than a deliberate choice.

replies(3): >>44532546 #>>44533950 #>>44538735 #
3. mbauman ◴[] No.44532546[source]
Willison's razor: Never dismiss behaviors as either malice or stupidity when there's a much more interesting option that can be explored.
replies(1): >>44562132 #
4. timmytokyo ◴[] No.44533950[source]
Occam's razor would seem to apply here.
5. spacechild1 ◴[] No.44538735[source]
What if searching for Elon's tweets was indeed intended, but it wasn't supposed to show up in the UI?
6. s3p ◴[] No.44562132{3}[source]
I side with Occam's razor here, and with another commenter in this thread. People are construing entire conspiracy theories to explain fake replies when asked for system prompt, lying in Github repos, etc.