(qwen.ai)

314 points pretext | 1 comments | 10 Dec 25 16:13 UTC | HN request time: 0.2s | source

Show context

sosodev ◴[10 Dec 25 16:55 UTC] No.46220123[source]▶

>>46219538 (OP) #

Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at their documentation it doesn't seem like it does.

Are there any open weight models that do? Not talking about speech to text -> LLM -> text to speech btw I mean a real voice <-> language model.

edit:

It does support real-time conversation! Has anybody here gotten that to work on local hardware? I'm particularly curious if anybody has run it with a non-nvidia setup.

replies(4): >>46220228 #>>46222544 #>>46223129 #>>46224919 #

potatoman22 ◴[10 Dec 25 22:34 UTC] No.46224919[source]▶

>>46220123 #

From what I can tell, their official chat site doesn't have a native audio -> audio model yet. I like to test this through homophones (e.g. record and record) and asking it to change its pitch or produce sounds.

replies(3): >>46225836 #>>46227448 #>>46227486 #

sosodev ◴[10 Dec 25 23:58 UTC] No.46225836[source]▶

>>46224919 #

Huh, you're right. I tried your test and it clearly can't understand the difference between homophones. That seems to imply they're using some sort of TTS mechanism. Which is really weird because Qwen3-Omni claims to support direct audio input into the model. Maybe it's a cost saving measure?

replies(2): >>46227943 #>>46238306 #

1. sosodev ◴[11 Dec 25 05:19 UTC] No.46227943[source]▶

>>46225836 #

Weirdly, I just tried it again and it seems to understand the difference between record and record just fine. Perhaps if there's heavy demand for voice chat, like after a new release, they load shed by using TTS to a smaller model.

However, It still doesn't seem capable of producing any of the sounds, like laughter, that I would expect from a native voice model.

↑

Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model