The main issue I'm facing with realtime responses (speech output) is how to separate non-diegetic outputs (e.g thinking, structured outputs) from outputs meant to be heard by the end user.
I'm curious how anyone has solved this
replies(1):
I'm curious how anyone has solved this