That's not even considering tool use!
If you hook up a chat bot to a chat interface, or add tool use, it is probable that it will eventually output something that it should not and that output will cause a problem. Preventing that is an unsolved problem, just as preventing people from abusing computers is an unsolved problem.
(1) Execute yes (with or without arguments, whatever you desire).
(2) Let the program run as long as you desire.
(3) When you stop desiring the program to spit out your argument,
(4) Stop the program.
Between (3) and (4) some time must pass. During this time the program is behaving in an undesired way. Ergo, yes is not a counter example of the GP's claim.
That said, I suspect the other person was actually agreeing with me, and tried to state that software incorporating LLMs would eventually malfunction by stating that this is true for all software. The yes program was an obvious counter example. It is almost certain that all LLMs will eventually generate some output that is undesired given that it is determining the next token to output based on probabilities. I say almost only because I do not know how to prove the conjecture. There is also some ambiguity in what is a LLM, as the first L means large and nobody has made a precise definition of what is large. If you look at literature from several years ago, you will find people saying 100 million parameters is large, while some people these days will refuse to use the term LLM to describe a model of that size.