(qwen.ai)

314 points pretext | 2 comments | 10 Dec 25 16:13 UTC | HN request time: 0.407s | source

Show context

sim04ful ◴[10 Dec 25 17:51 UTC] No.46220935[source]▶

>>46219538 (OP) #

The main issue I'm facing with realtime responses (speech output) is how to separate non-diegetic outputs (e.g thinking, structured outputs) from outputs meant to be heard by the end user.

I'm curious how anyone has solved this

replies(1): >>46222239 #

artur44 ◴[10 Dec 25 19:19 UTC] No.46222239[source]▶

>>46220935 #

A simple way is to split the model’s output stream before TTS. Reasoning/structured tokens go into one bucket, actual user-facing text into another. Only the second bucket is synthesized. Most thinking out loud issues come from feeding the whole stream directly into audio.

replies(1): >>46223160 #

pugio ◴[10 Dec 25 20:19 UTC] No.46223160[source]▶

>>46222239 #

There is no TTS here. It's a native audio output model which outputs audio tokens directly. (At least, that's how the other real-time models work. Maybe I've misunderstood the Qwen-Omni architecture.)

replies(1): >>46223385 #

1. artur44 ◴[10 Dec 25 20:35 UTC] No.46223385[source]▶

>>46223160 #

True, but even with native audio-token models you still need to split the model’s output channels. Reasoning/internal tokens shouldn't go into the audio stream only user-facing content should be emitted as audio. The principle is the same, whether the last step is TTS or audio token generation.

replies(1): >>46232625 #

2. regularfry ◴[11 Dec 25 15:32 UTC] No.46232625[source]▶

>>46223385 (TP) #

There's an assumption there that the audio stream contains an equivalent of the <think>/</think> tokens. Every reason to think it should, but without seeing the tokeniser config it's a bit of a guess.

↑

Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model