Hallucinations in code are the least dangerous form of LLM mistakes

(simonwillison.net)

371 points ulrischa | 1 comments | 02 Mar 25 19:15 UTC | HN request time: 0.305s | source

Show context

t_mann ◴[02 Mar 25 21:45 UTC] No.43235506[source]▶

Hallucinations themselves are not even the greatest risk posed by LLMs. A much greater risk (in simple terms of probability times severity) I'd say is that chat bots can talk humans into harming themselves or others. Both of which have already happened, btw [0,1]. Still not sure if I'd call that the greatest overall risk, but my ideas for what could be even more dangerous I don't even want to share here.

[0] https://www.qut.edu.au/news/realfocus/deaths-linked-to-chatb...

[1] https://www.theguardian.com/uk-news/2023/jul/06/ai-chatbot-e...

replies(4): >>43235623 #>>43236225 #>>43238379 #>>43238746 #

hexaga ◴[02 Mar 25 23:04 UTC] No.43236225[source]▶

>>43235506 #

More generally - AI that is good at convincing people is very powerful, and powerful things are dangerous.

I'm increasingly coming around to the notion that AI tooling should have safety features concerned with not directly exposing humans to asymptotically increasing levels of 'convincingness' in generated output. Something like a weaker model used as a buffer.

Projecting out to 5-10 years: what happens when LLMs are still producing hallucinatory semi-sense, but merely comprehending it makes the machine temporarily own you? A bit like getting hair caught in an angle grinder, that.

Like most safety regulations, it'll take blood for the inking. Exposing mass numbers of people to these models strikes me as wildly negligent if we expect continued improvement along this axis.

replies(2): >>43236968 #>>43238275 #

southernplaces7 ◴[03 Mar 25 04:23 UTC] No.43238275[source]▶

>>43236225 #

>Projecting out to 5-10 years: what happens when LLMs are still producing hallucinatory semi-sense, but merely comprehending it makes the machine temporarily own you? A bit like getting hair caught in an angle grinder, that.

Seriously? Do you suppose that it will pull this trick off through some sort of hypnotizing magic perhaps? I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.

The kinds of people who would be convinced by such "dangers" are likely to be mentally unstable or suggestible enough about it to in any case be convinced by any number of human beings anyhow.

Aside from demonstrating the persistent AI woo that permeats many comments on this site, the logic above reminds me of the harping nonsense around the supposed dangers of video games or certain violent movies "making kinds do bad things", in years past. The prohibitionist nanny tendencies behind such fears are more dangerous than any silly chatbot AI..

replies(2): >>43241236 #>>43242612 #

hexaga ◴[03 Mar 25 12:49 UTC] No.43241236[source]▶

>>43238275 #

If you believe current models exist at the limit of possible persuasiveness, there obviously isn't any cause for concern.

For various reasons, I don't believe that, which is why my argument is predicated on them improving over time. Obviously current models aren't overly hazardous in the sense I posit - it's a concern for future models that are stronger, or explicitly trained to be more engaging and/or convincing.

The load bearing element is the answer to: "are models becoming more convincing over time?" not "are they very convincing now?"

> [..] I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot [..]

Then you're not engaging with the premise at all, and are attacking a point I haven't made. The tautological assurance that non-convincing AI is not convincing is not relevant to a concern predicated on the eventual existence of highly convincing AI: that sufficiently convincing AI is hazardous due to induced loss of control, and that as capabilities increase the loss of control becomes more difficult to resist.

replies(2): >>43247874 #>>43253374 #

OkayPhysicist ◴[03 Mar 25 23:01 UTC] No.43247874[source]▶

>>43241236 #

You're describing a phase change in persuasiveness which we have no evidence for. If humans were capable of being immediately compelled to do something based on reading some text, advertisers would have taken advantage of that a looooong time ago.

Persuasion is mostly about establishing that doing or believing what you're telling them is in their best interest. If all my friends start telling me a piece of information, belief in that information has a real interest to me, as it would help strengthen social bonds. If I have a consciously weakly held belief in something, then a compelling argument would consist of providing enough evidence for a viewpoint that I could confidently hold that view and not worry I'll appear misinformed when speaking on it.

Convincing me to do something involves establishing that either I'll face negative consequences for not doing it, or positive rewards for doing it. AI has an extremely difficult time establishing that kind of credibility.

To argue that an AI could become persuasive to the point of mind control is to assert that one can compell a belief in another without the ability to take real-world action.

The absolute worst case scenario for a rogue AI is it leveraging people's belief in it to compel actions in others by way of a combination of blackmail, rewards, and threats of compelling others to commit violence on its behalf by a combination of the same.

We already live in a world with such artificial intelligences: we call them governments and corporations.

replies(1): >>43261846 #

1. hexaga ◴[05 Mar 25 02:22 UTC] No.43261846[source]▶

>>43247874 #

> You're describing a phase change in persuasiveness which we have no evidence for.

That's reasonable, and I really do hope this keeps on being the case. However, I would nit that I see this as a continuum rather than a phase change. That is, I think hazard smoothly increases with persuasiveness. I can point to some far off region and say: "oh, that seems quite concerning" but it doesn't start being so there.

Persuasiveness below the threshold of 'instant mind control' is still a hazard. Hanging out with salesmen on the job is like to loosen your wallet, even if it isn't guaranteed.

> If humans were capable of being immediately compelled to do something based on reading some text, advertisers would have taken advantage of that a looooong time ago.

I'd base my counter on the notion that the problem of persuasion is harder when you have less information about whom you're trying to convince.

To expand on the intuition behind that: advertisement-persuasion is hard in a way that conversational-persuasion is not. Shilling in conversational contexts (word of mouth) is more effective than generic advertisement.

A message that will convince one specific person is easier to generate than a message that will convince any random 10 people.

This proceeds to the idea that information about a person-under-persuasion is akin to power over them. Knowing not only what you believe but why you believe it and what else you believe adjacent to it and what you want is a force multiplier in this regard.

And so we get to AI models, which gather specific information about the mind of each person they interact with. The message is tailored to you and you alone, it is not a wide spectrum net cast to catch the largest possible number. Advertisements are qualitatively different; they do not 'pick your brain' nearly so much as the model does.

> Convincing me to do something involves establishing that either I'll face negative consequences for not doing it, or positive rewards for doing it. AI has an extremely difficult time establishing that kind of credibility.

> To argue that an AI could become persuasive to the point of mind control is to assert that one can compell a belief in another without the ability to take real-world action.

I don't agree with this because I don't agree with the premise that you must use a 'principled' approach to convince someone as you've described. People use heuristics to decide what to believe.

By dint of the bitter lesson, I think superhuman persuasion will involve stupid tricks of no particular principled basis that take advantage of 'invisible' vulnerabilities in human cognition.

That is, I don't think those 'reasons to believe the belief' matter. A child will believe the voice of their parents; it doesn't necessarily register that it's in their best interest or it will be bad for them if they don't. Bootstrapping children involves exploiting vulnerabilities in their psyche via implicit trust. Will the AI speak in the voice of my father, as I might hear it in prelingual childhood? Are all such mechanisms gone by adulthood? Is there anything like a generalized follow-the-leader-with-leader-detection pattern?

How hard is it for gradient descent to fit a solution to the boundaries of such heuristics?

This is however, getting into the weeds of exact mechanisms which I'm not too concerned with. I believe (but can't prove) that exploits of that nature exist (or that similarly effective means exist), and that they can be found via brute force search. I think the dominant methodology of continuously training chat models on conversational data those same models participate in is among the likeliest of ways to get to that point.

Ultimately, so long as there's no directed pressure to force people into contact with very convincing model output (see your rogue AI scenario), it doesn't seem that hard to make it safe: limit direct contact and/or require that tooling limits contact by default. Avoid multi-turn refinement and conversational history (amplification of persuasive power via mechanism described above). Treat it like a spinning blade and be it on your own head if you want to break yourself.

However, as I mentioned in my original comment, it will take blood for the inking. The incentives don't align to guard against this class of hazard from the get-go or even admit it is possible (merely to produce appearances of caring about 'safety' (read: our model won't do scary politically incorrect things!)), so we're going to see what happens when you mindlessly expose millions of people to it.

↑