←back to thread

314 points pretext | 1 comments | | HN request time: 0.206s | source
Show context
sosodev ◴[] No.46220123[source]
Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at their documentation it doesn't seem like it does.

Are there any open weight models that do? Not talking about speech to text -> LLM -> text to speech btw I mean a real voice <-> language model.

edit:

It does support real-time conversation! Has anybody here gotten that to work on local hardware? I'm particularly curious if anybody has run it with a non-nvidia setup.

replies(4): >>46220228 #>>46222544 #>>46223129 #>>46224919 #
red2awn ◴[] No.46222544[source]
None of inference frameworks (vLLM/SGLang) supports the full model, let alone non-nvidia.
replies(3): >>46223310 #>>46223630 #>>46226911 #
AndreSlavescu ◴[] No.46223630[source]
We actually deployed working speech to speech inference that builds on top of vLLM as the backbone. The main thing was to support the "Talker" module, which is currently not supported on the qwen3-omni branch for vLLM.

Check it out here: https://models.hathora.dev/model/qwen3-omni

replies(2): >>46223885 #>>46224354 #
sosodev ◴[] No.46224354[source]
Is your work open source?
replies(1): >>46278997 #
1. AndreSlavescu ◴[] No.46278997[source]
At the moment, no unfortunately. However, to my recent knowledge of open source alternatives, the vLLM team published a separate repository for omni models now:

https://github.com/vllm-project/vllm-omni

I have not yet tested out if this does full speech to speech, but this seems like a promising workspace for omni-modal models.