That comment made sense 3 years ago. LLMs
already solved "intuitive understanding", and the realtime multimodal variants (e.g. the thing behind "Advanced Voice" in ChatGPT app) handle tone of voice in both directions. As for nonverbal cues, I don't know yet - I got live video enabled in ChatGPT only few days ago and didn't have time to test it, but I would be surprised if it couldn't read the basics of body language at this point.
Talking to a computer still sucks as an user interface - not because a computer can't communicate on multiple channels the way people do, as it can do it now too. It sucks for the same reason talking to people sucks as an user interface - because the kind of tasks we use computers for (and that aren't just talking with/to/at other people via electronic means) are better handle by doing than by talking about them. We need an interface to operate a tool, not an interface to an agent that operates a tool for us.
As an example, consider driving (as in, realtime control - not just "getting from point A to B"): a chat interface to driving would suck just as badly as being a backseat driver sucks for both people in the car. In contrast, a steering wheel, instead of being a bandwidth-limiting indirection, is an anti-indirection - not only it lets you control the machine with your body, the control is direct enough that over time your brain learns to abstract it away, and the car becomes an extension of your body. We need more of tangible interfaces like that with computers.
The steering wheel case, of course, would fail with "AI-level smarts" - but that still doesn't mean we should embrace talking to computers. A good analogy is dance - it's an interaction between two independently smart agents exploring an activity together, and as they do it enough, it becomes fluid.
So dance, IMO, is the steering wheel analogy for AI-powered interfaces, and that is the space we need to explore more.