←back to thread

314 points pretext | 1 comments | | HN request time: 0.24s | source
Show context
sosodev ◴[] No.46220123[source]
Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at their documentation it doesn't seem like it does.

Are there any open weight models that do? Not talking about speech to text -> LLM -> text to speech btw I mean a real voice <-> language model.

edit:

It does support real-time conversation! Has anybody here gotten that to work on local hardware? I'm particularly curious if anybody has run it with a non-nvidia setup.

replies(4): >>46220228 #>>46222544 #>>46223129 #>>46224919 #
potatoman22 ◴[] No.46224919[source]
From what I can tell, their official chat site doesn't have a native audio -> audio model yet. I like to test this through homophones (e.g. record and record) and asking it to change its pitch or produce sounds.
replies(3): >>46225836 #>>46227448 #>>46227486 #
sosodev ◴[] No.46225836[source]
Huh, you're right. I tried your test and it clearly can't understand the difference between homophones. That seems to imply they're using some sort of TTS mechanism. Which is really weird because Qwen3-Omni claims to support direct audio input into the model. Maybe it's a cost saving measure?
replies(2): >>46227943 #>>46238306 #
1. potatoman22 ◴[] No.46238306[source]
To be fair, discerning heteronyms might just be a gap in its training.