Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model

1. sosodev ◴[10 Dec 25 16:55 UTC] No.46220123[source]▶

>>46219538 (OP) #

Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at their documentation it doesn't seem like it does.

Are there any open weight models that do? Not talking about speech to text -> LLM -> text to speech btw I mean a real voice <-> language model.

edit:

It does support real-time conversation! Has anybody here gotten that to work on local hardware? I'm particularly curious if anybody has run it with a non-nvidia setup.

replies(4): >>46220228 #>>46222544 #>>46223129 #>>46224919 #

2. dsrtslnd23 ◴[10 Dec 25 17:01 UTC] No.46220228[source]▶

>>46220123 (TP) #

it seems to be able to do native speech-speech

replies(1): >>46220381 #

3. sosodev ◴[10 Dec 25 17:12 UTC] No.46220381[source]▶

>>46220228 #

It does for sure. I did some more digging and it does real-time too. That's fascinating.

4. red2awn ◴[10 Dec 25 19:38 UTC] No.46222544[source]▶

>>46220123 (TP) #

None of inference frameworks (vLLM/SGLang) supports the full model, let alone non-nvidia.

replies(3): >>46223310 #>>46223630 #>>46226911 #

5. ivape ◴[10 Dec 25 20:16 UTC] No.46223129[source]▶

>>46220123 (TP) #

That's exciting. I doubt there are any polished voice chat local apps yet that you can easily plug this into (I doubt the user experience is "there" yet). Even stuff like Silly Tavern is near unusable, lots of work to be done on the local front. Local voice models are what's going to enable that whole Minority Report workflow soon enough (especially if commands and intent are determined at the local level, and the meat of the prompt is handled by a larger remote model).

This is part of programming that I think is the new field. There will be tons of work for those that can build the new workflows which will need to be primarily natural language driven.

replies(1): >>46223285 #

6. sosodev ◴[10 Dec 25 20:27 UTC] No.46223285[source]▶

>>46223129 #

I did find this app: https://github.com/gabber-dev/gabber

The creator posted a little demo of it working with Qwen3 Omni that is quite impressive: https://www.youtube.com/watch?v=5DBFVe3cLto

He didn't include any details regarding how the model was running though

7. sosodev ◴[10 Dec 25 20:29 UTC] No.46223310[source]▶

>>46222544 #

That's unfortunate but not too surprising. This type of model is very new to the local hosting space.

8. AndreSlavescu ◴[10 Dec 25 20:50 UTC] No.46223630[source]▶

>>46222544 #

We actually deployed working speech to speech inference that builds on top of vLLM as the backbone. The main thing was to support the "Talker" module, which is currently not supported on the qwen3-omni branch for vLLM.

Check it out here: https://models.hathora.dev/model/qwen3-omni

replies(2): >>46223885 #>>46224354 #

9. red2awn ◴[10 Dec 25 21:08 UTC] No.46223885{3}[source]▶

>>46223630 #

Nice work. Are you working on streaming input/output?

replies(1): >>46223956 #

10. AndreSlavescu ◴[10 Dec 25 21:14 UTC] No.46223956{4}[source]▶

>>46223885 #

Yeah, that's something we currently support. Feel free to try the platform out! No cost to you for now, you just need a valid email to sign up on the platform.

replies(1): >>46229560 #

11. sosodev ◴[10 Dec 25 21:49 UTC] No.46224354{3}[source]▶

>>46223630 #

Is your work open source?

replies(1): >>46278997 #

12. potatoman22 ◴[10 Dec 25 22:34 UTC] No.46224919[source]▶

>>46220123 (TP) #

From what I can tell, their official chat site doesn't have a native audio -> audio model yet. I like to test this through homophones (e.g. record and record) and asking it to change its pitch or produce sounds.

replies(3): >>46225836 #>>46227448 #>>46227486 #

13. sosodev ◴[10 Dec 25 23:58 UTC] No.46225836[source]▶

>>46224919 #

Huh, you're right. I tried your test and it clearly can't understand the difference between homophones. That seems to imply they're using some sort of TTS mechanism. Which is really weird because Qwen3-Omni claims to support direct audio input into the model. Maybe it's a cost saving measure?

replies(2): >>46227943 #>>46238306 #

14. whimsicalism ◴[11 Dec 25 02:23 UTC] No.46226911[source]▶

>>46222544 #

Makes sense, I think streaming audio->audio inference is a relatively big lift.

replies(1): >>46229292 #

15. djtango ◴[11 Dec 25 03:53 UTC] No.46227448[source]▶

>>46224919 #

Is record a homophone? At least in the UK we use different pronunciations for the meanings. Re-cord for the verb, rec-ord for the noun.

replies(1): >>46238269 #

16. dragonwriter ◴[11 Dec 25 03:59 UTC] No.46227486[source]▶

>>46224919 #

“record and record”, if you mean the verb for persisting something and the noun for the thing persisted, are heteronyms (homographs which are not homophones), which incidentally is also what you would probably want to test what you are talking about here (distinguishing homophones would test use of context to understand meaning, but wouldn’t test anything about whether or not logic was working directly on audio or only working on text processed from audio, failing to distinguish heteronyms is suggestive of processing occurring on text, not audio directly.)

replies(2): >>46227622 #>>46238285 #

17. bakeman ◴[11 Dec 25 04:21 UTC] No.46227622{3}[source]▶

>>46227486 #

There are homophones of “record”, such as:

“He’s on record saying he broke the record for spinning a record.”

replies(1): >>46227911 #

18. dragonwriter ◴[11 Dec 25 05:13 UTC] No.46227911{4}[source]▶

>>46227622 #

True.

OTOH my point that the thing being suggested to be tested is not testable by seeing whether or not the system is capable of distinguishing homophones, but might be by seeing whether or not it distingishes heteronyms still stands. (The speculation that the record/record distinction intended was one that is actually a pair of heteronyms and that the error was merely the use of the word “homophone" in place of “heteronym”, rather than the basic logic of the comment is somewhat tangential to the main point.)

19. sosodev ◴[11 Dec 25 05:19 UTC] No.46227943{3}[source]▶

>>46225836 #

Weirdly, I just tried it again and it seems to understand the difference between record and record just fine. Perhaps if there's heavy demand for voice chat, like after a new release, they load shed by using TTS to a smaller model.

However, It still doesn't seem capable of producing any of the sounds, like laughter, that I would expect from a native voice model.

20. red2awn ◴[11 Dec 25 09:21 UTC] No.46229292{3}[source]▶

>>46226911 #

Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.

replies(1): >>46234434 #

21. valleyer ◴[11 Dec 25 10:07 UTC] No.46229560{5}[source]▶

>>46223956 #

I tried this out, and it's not passing the record (n.) vs. record (v.) test mentioned elsewhere in this thread. (I can ask it to repeat one, and it often repeats the other.) Am I not enabling the speech-to-speech-ness somehow?

replies(1): >>46278969 #

22. whimsicalism ◴[11 Dec 25 17:37 UTC] No.46234434{4}[source]▶

>>46229292 #

I imagine you have to start decoding many speculative completions in parallel to have true low latency.

23. potatoman22 ◴[11 Dec 25 22:36 UTC] No.46238269{3}[source]▶

>>46227448 #

I was mistaken about what homophone means!

24. potatoman22 ◴[11 Dec 25 22:37 UTC] No.46238285{3}[source]▶

>>46227486 #

Ah I meant heteronyms. Thanks!

25. potatoman22 ◴[11 Dec 25 22:38 UTC] No.46238306{3}[source]▶

>>46225836 #

To be fair, discerning heteronyms might just be a gap in its training.

26. AndreSlavescu ◴[15 Dec 25 19:14 UTC] No.46278969{6}[source]▶

>>46229560 #

From my understanding of the above problem, this would be something to do with the model weights. Have you tested this with the transformers inference baseline that is shown on huggingface?

In our deployment, we do not actually tune the model in any way, this is all just using the base instruct model provided on huggingface:

https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct

And with the potential concern around conversation turns, our platform is designed for one-off record -> response flows. But via the API, you can build your own conversation agent to use the model.

27. AndreSlavescu ◴[15 Dec 25 19:15 UTC] No.46278997{4}[source]▶

>>46224354 #

At the moment, no unfortunately. However, to my recent knowledge of open source alternatives, the vLLM team published a separate repository for omni models now:

https://github.com/vllm-project/vllm-omni

I have not yet tested out if this does full speech to speech, but this seems like a promising workspace for omni-modal models.