←back to thread

314 points pretext | 1 comments | | HN request time: 0s | source
Show context
sosodev ◴[] No.46220123[source]
Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at their documentation it doesn't seem like it does.

Are there any open weight models that do? Not talking about speech to text -> LLM -> text to speech btw I mean a real voice <-> language model.

edit:

It does support real-time conversation! Has anybody here gotten that to work on local hardware? I'm particularly curious if anybody has run it with a non-nvidia setup.

replies(4): >>46220228 #>>46222544 #>>46223129 #>>46224919 #
red2awn ◴[] No.46222544[source]
None of inference frameworks (vLLM/SGLang) supports the full model, let alone non-nvidia.
replies(3): >>46223310 #>>46223630 #>>46226911 #
whimsicalism ◴[] No.46226911[source]
Makes sense, I think streaming audio->audio inference is a relatively big lift.
replies(1): >>46229292 #
red2awn ◴[] No.46229292[source]
Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.
replies(1): >>46234434 #
1. whimsicalism ◴[] No.46234434{3}[source]
I imagine you have to start decoding many speculative completions in parallel to have true low latency.