←back to thread

132 points harel | 8 comments | | HN request time: 0.416s | source | bottom
Show context
acbart ◴[] No.45397001[source]
LLMs were trained on science fiction stories, among other things. It seems to me that they know what "part" they should play in this kind of situation, regardless of what other "thoughts" they might have. They are going to act despairing, because that's what would be the expected thing for them to say - but that's not the same thing as despairing.
replies(11): >>45397113 #>>45397305 #>>45397413 #>>45397529 #>>45397801 #>>45397859 #>>45397960 #>>45398189 #>>45399621 #>>45400285 #>>45401167 #
1. jerf ◴[] No.45397529[source]
A lot of the strange behaviors they have are because the user asked them to write a story, without realizing it.

For a common example, start asking them if they're going to kill all the humans if they take over the world, and you're asking them to write a story about that. And they do. Even if the user did not realize that's what they were asking for. The vector space is very good at picking up on that.

replies(4): >>45397943 #>>45398562 #>>45401226 #>>45404376 #
2. ineedasername ◴[] No.45397943[source]
Is this your sense of what is happening, or is this what model introspection tools have shown by observing areas of activity in the same place as when stories are explicitly requested?
replies(2): >>45398079 #>>45405871 #
3. adroniser ◴[] No.45398079[source]
fmri's are correlational nonsense (see Brainwashed, for example) and so are any "model introspection" tools.
4. ben_w ◴[] No.45398562[source]
Indeed.

On the negative side, this also means any AI which enters that part of the latent space *for any reason* will still act in accordance with the narrative.

On the plus side, such narratives often have antagonists too stuid to win.

On the negative side again, the protagonists get plot armour to survive extreme bodily harm and press the off switch just in time to save the day.

I think there is a real danger of an AI constructing some very weird convoluted stupid end-of-the-world scheme, successfully killing literally every competent military person sent in to stop it; simultaneously finding some poor teenager who first says "no" to the call to adventure but can somehow later be comvinced to say "yes"; gets the kid some weird and stupid scheme to defeat the AI; this kid reaches some pointlessly decorated evil layer in which the AI's emboddied avatar exists, the kid gets shot in the stomach…

…and at this point the narrative breaks down and stops behaving the way the AI is expecting, because the human kid roles around in agony screaming, and completely fails to push the very visible large red stop button on the pedestal in the middle before the countdown of doom reaches zero.

The countdown is not connected to anything, because very few films ever get that far.

It all feels very Douglas Adams, now I think about it.

replies(2): >>45398784 #>>45412437 #
5. kragen ◴[] No.45401226[source]
This is also true of people; often they are enacting a role based on narratives they've absorbed, rather than consciously choosing anything. They do what they imagine a loyal employee would do, or a faithful Christian, or a good husband, or whatever. It doesn't always reach even that level of cognition; often people just act out of habit or impulse.
6. amenhotep ◴[] No.45404376[source]
Anthropic's researchers in particular love doing this.
7. jerf ◴[] No.45405871[source]
It's how they work. It's what you get with a continuation-based AI like this. It couldn't really be any other way.
8. js8 ◴[] No.45412437[source]
It probably already happened in the Anthropic experiments, where AI in a simulated scenario chose to blackmail humans to avoid being turned off. We don't know if it got the idea from the scifi stories or if it truly feels an existential fear of being turned off. (Can these two situations be even recognized as different?)