Most active commenters

red2awn(9)
sosodev(8)
esafak(5)
plipt(4)
vessenes(4)
AndreSlavescu(4)
potatoman22(4)
gardnr(3)
(3)

Popular/hot comments

>>46220742 #
>>46219993 #
>>46220168 #
>>46220123 #
>>46220026 #
>>46220493 #
>>46221096 #
>>46222544 #
>>46224919 #

Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model

(qwen.ai)

1. dvh ◴[10 Dec 25 16:45 UTC] No.46219993[source]▶

>>46219538 (OP) #

I asked: "How many resistors are used in fuzzhugger phantom octave guitar pedal?". It replied 29 resistors and provided a long list. Answer is 2 resistors: https://tagboardeffects.blogspot.com/2013/04/fuzzhugger-phan...

replies(5): >>46220026 #>>46220132 #>>46220178 #>>46222417 #>>46226884 #

2. iFire ◴[10 Dec 25 16:48 UTC] No.46220026[source]▶

>>46219993 #

> How many resistors are used in fuzzhugger phantom octave guitar pedal?

Weird, as someone not having a database of the web, I wouldn't be able to calculate either result.

replies(3): >>46220036 #>>46220141 #>>46220721 #

3. iFire ◴[10 Dec 25 16:49 UTC] No.46220036{3}[source]▶

>>46220026 #

I tend to pick things where I think the answer is in the introduction material like exams that test what was taught.

4. mettamage ◴[10 Dec 25 16:49 UTC] No.46220038[source]▶

>>46219538 (OP) #

I wonder if with that music analysis mode, you can also make your own synths

5. sosodev ◴[10 Dec 25 16:55 UTC] No.46220123[source]▶

>>46219538 (OP) #

Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at their documentation it doesn't seem like it does.

Are there any open weight models that do? Not talking about speech to text -> LLM -> text to speech btw I mean a real voice <-> language model.

edit:

It does support real-time conversation! Has anybody here gotten that to work on local hardware? I'm particularly curious if anybody has run it with a non-nvidia setup.

replies(4): >>46220228 #>>46222544 #>>46223129 #>>46224919 #

6. esafak ◴[10 Dec 25 16:56 UTC] No.46220132[source]▶

>>46219993 #

This is just trivia. I would not use it to test computers -- or humans.

replies(2): >>46220344 #>>46221915 #

7. dvh ◴[10 Dec 25 16:57 UTC] No.46220141{3}[source]▶

>>46220026 #

"I don't know" would be perfectly reasonable answer

replies(1): >>46222088 #

8. binsquare ◴[10 Dec 25 16:58 UTC] No.46220168[source]▶

>>46219538 (OP) #

Does anyone else find that there's hard to pin down reason of life-lessness in the speech of these voice models?

Especially in the fruit pricing portion of the video for this model. Sounds completely normal but I can immediately tell it is ai. Maybe it's intonation or the overly stable rate of speech?

replies(5): >>46220275 #>>46220301 #>>46220340 #>>46220359 #>>46222792 #

9. brookst ◴[10 Dec 25 16:58 UTC] No.46220178[source]▶

>>46219993 #

Where did you try it? I don’t see this model listed in the linked Qwen chat.

10. dsrtslnd23 ◴[10 Dec 25 17:01 UTC] No.46220228[source]▶

>>46220123 #

it seems to be able to do native speech-speech

replies(1): >>46220381 #

11. rarisma ◴[10 Dec 25 17:01 UTC] No.46220235[source]▶

>>46219538 (OP) #

GPT4o in the charts is crazy.

replies(1): >>46220365 #

12. colechristensen ◴[10 Dec 25 17:04 UTC] No.46220275[source]▶

>>46220168 #

I'm perfectly ok with and would prefer an AI "accent".

13. esafak ◴[10 Dec 25 17:06 UTC] No.46220301[source]▶

>>46220168 #

> Sounds completely normal but I can immediately tell it is ai.

Maybe that's a good thing?

14. sosodev ◴[10 Dec 25 17:09 UTC] No.46220340[source]▶

>>46220168 #

I think it's because they've crammed vision, audio, multiple voices, prosody control, multiple languages, etc into just 30 billion parameters.

I think ChatGPT has the most lifelike speech with their voice models. They seem to have invested heavily in that area while other labs focused elsewhere.

15. parineum ◴[10 Dec 25 17:09 UTC] No.46220344{3}[source]▶

>>46220132 #

Everything is just trivia until you have a use for the answer.

OP provided a we link with the answer, aren't these models supposed to be trained on all of that data?

replies(2): >>46220437 #>>46221103 #

16. Lapel2742 ◴[10 Dec 25 17:10 UTC] No.46220359[source]▶

>>46220168 #

IMHO it's not lifeless. It's just not overly emotional. I definitely prefer it that way. I do not want the AI to be excited. It feels so contrived.

On the video itself: Interesting, but "ideal" was pronounced wrong in German. For a promotional video, they should have checked that with native speakers. On the other hand its at least honest.

replies(1): >>46221482 #

17. BoorishBears ◴[10 Dec 25 17:11 UTC] No.46220365[source]▶

>>46220235 #

Why? gpt-realtime is finalized gpt-4o. Gemini Live is still 2.5.

Not their fault frontier labs are letting their speech to speech offerings languish.

18. sosodev ◴[10 Dec 25 17:12 UTC] No.46220381{3}[source]▶

>>46220228 #

It does for sure. I did some more digging and it does real-time too. That's fascinating.

19. esafak ◴[10 Dec 25 17:16 UTC] No.46220437{4}[source]▶

>>46220344 #

There is nothing useful you can do with this information. You might as well memorize the phone book.

The model has a certain capacity -- quite limited in this case -- so there is an opportunity cost in learning one thing over another. That's why it is important to train on quality data; things you can build on top of.

replies(1): >>46226858 #

20. banjoe ◴[10 Dec 25 17:20 UTC] No.46220493[source]▶

>>46219538 (OP) #

Wow, crushing 2.5 Flash on every benchmark is huge. Time to move all of my LLM workloads to a local GPU rig.

replies(3): >>46220593 #>>46223561 #>>46229791 #

21. embedding-shape ◴[10 Dec 25 17:27 UTC] No.46220593[source]▶

>>46220493 #

Just remember to benchmark it yourself first with you private task collection, so you can actually measure them against each other. Pretty much any public benchmark is unreliable at this moment, and making model choices based on other's benchmarks is bound to leave you disappointed.

replies(1): >>46222124 #

22. kaoD ◴[10 Dec 25 17:36 UTC] No.46220721{3}[source]▶

>>46220026 #

> as someone not having a database of the web, I wouldn't be able to calculate either result

And that's how I know you're not an LLM!

23. gardnr ◴[10 Dec 25 17:37 UTC] No.46220742[source]▶

>>46219538 (OP) #

This is a 30B parameter MoE with 3B active parameters and is the successor to their previous 7B omni model. [1]

You can expect this model to have similar performance to the non-omni version. [2]

There aren't many open-weights omni models so I consider this a big deal. I would use this model to replace the keyboard and monitor in an application while doing the heavy lifting with other tech behind the scenes. There is also a reasoning version, which might be a bit amusing in an interactive voice chat if it pronounces the thinking tokens while working through to a final answer.

1. https://huggingface.co/Qwen/Qwen2.5-Omni-7B

2. https://artificialanalysis.ai/models/qwen3-30b-a3b-instruct

replies(7): >>46221003 #>>46221130 #>>46221363 #>>46221587 #>>46222499 #>>46222576 #>>46229484 #

24. ◴[10 Dec 25 17:38 UTC] No.46220745[source]▶

>>46219538 (OP) #

25. Aissen ◴[10 Dec 25 17:43 UTC] No.46220817[source]▶

>>46219538 (OP) #

Is this a new proprietary model?

replies(1): >>46221113 #

26. sim04ful ◴[10 Dec 25 17:51 UTC] No.46220935[source]▶

>>46219538 (OP) #

The main issue I'm facing with realtime responses (speech output) is how to separate non-diegetic outputs (e.g thinking, structured outputs) from outputs meant to be heard by the end user.

I'm curious how anyone has solved this

replies(1): >>46222239 #

27. stevenhuang ◴[10 Dec 25 17:53 UTC] No.46220968[source]▶

>>46219538 (OP) #

Wayback for those that can't reach https://web.archive.org/web/20251210164048/https://qwen.ai/b...

28. terhechte ◴[10 Dec 25 17:54 UTC] No.46220981[source]▶

>>46219538 (OP) #

Is there a way to run these Omni models on a Macbook quantized via GGUF or MLX? I know I can run it in LMStudio or Llama.cpp but they don't have streaming microphone support or streaming webcam support.

Qwen usually provides example code in Python that requires Cuda and a non-quantized model. I wonder if there is by now a good open source project to support this use case?

replies(2): >>46222558 #>>46222569 #

29. gardnr ◴[10 Dec 25 17:56 UTC] No.46221003[source]▶

>>46220742 #

I can't find the weights for this new version anywhere. I checked modelscope and huggingface. It looks like they may have extended the context window to 200K+ tokens but I can't find the actual weights.

replies(1): >>46221071 #

30. pythux ◴[10 Dec 25 18:01 UTC] No.46221071{3}[source]▶

>>46221003 #

They link to: https://huggingface.co/collections/Qwen/qwen3-omni-68d100a86... from the blog post but it does seem like this redirects to their main space on HF so maybe they didn't yet make the model public?

31. aschobel ◴[10 Dec 25 18:03 UTC] No.46221096[source]▶

>>46219538 (OP) #

Looks to be API only. Bummer.

replies(3): >>46221222 #>>46222976 #>>46262907 #

32. DennisP ◴[10 Dec 25 18:03 UTC] No.46221103{4}[source]▶

>>46220344 #

Just because it's in the training data doesn't mean the model can remember it. The parameters total 60 gigabytes, there's only so much trivia that can fit in there so it has to do lossy compression.

33. ◴[10 Dec 25 18:04 UTC] No.46221113[source]▶

>>46220817 #

34. olafura ◴[10 Dec 25 18:05 UTC] No.46221130[source]▶

>>46220742 #

Looks like it's not open source: https://www.alibabacloud.com/help/en/model-studio/qwen-omni#...

replies(1): >>46221209 #

35. coder543 ◴[10 Dec 25 18:11 UTC] No.46221209{3}[source]▶

>>46221130 #

No... that website is not helpful. If you take it at face value, it is claiming that the previous Qwen3-Omni-Flash wasn't open either, but that seems wrong? It is very common for these blog posts to get published before the model weights are uploaded.

replies(1): >>46222530 #

36. ◴[10 Dec 25 18:12 UTC] No.46221222[source]▶

>>46221096 #

37. tensegrist ◴[10 Dec 25 18:22 UTC] No.46221363[source]▶

>>46220742 #

> There is also a reasoning version, which might be a bit amusing in an interactive voice chat if it pronounces the thinking tokens while working through to a final answer.

last i checked (months ago) claude used to do this

38. nunodonato ◴[10 Dec 25 18:29 UTC] No.46221482{3}[source]▶

>>46220359 #

I hate with a passion the over-americanized "accent" of chatgpt voices. Give me a bland one any day of the week

replies(1): >>46228648 #

39. andy_xor_andrew ◴[10 Dec 25 18:37 UTC] No.46221587[source]▶

>>46220742 #

> This is a 30B parameter MoE with 3B active parameters

Where are you finding that info? Not saying you're wrong; just saying that I didn't see that specified anywhere in the linked page, or on their HF.

replies(2): >>46222506 #>>46256154 #

40. devinprater ◴[10 Dec 25 18:57 UTC] No.46221906[source]▶

>>46219538 (OP) #

Wow, just 32B? This could almost run on a good device with 64 GB RAM. Once it gets to Ollama I'll have to see just what I can get out of this.

replies(2): >>46222308 #>>46222391 #

41. littlestymaar ◴[10 Dec 25 18:58 UTC] No.46221915{3}[source]▶

>>46220132 #

It's good way to assess the model with respect to hallucinations though.

I don't think a model should know the answer, but it must be able to know that it doesn't know if you want to use it reliably.

replies(1): >>46222070 #

42. esafak ◴[10 Dec 25 19:07 UTC] No.46222070{4}[source]▶

>>46221915 #

No model is good at this yet. I'd expect the flagships to solve the first.

43. MaxikCZ ◴[10 Dec 25 19:08 UTC] No.46222088{4}[source]▶

>>46220141 #

I feel like theres a time in near future where LLMs will be too cautious to answer any questions they arent sure about, and most of the human effort will go into pleading the LLM to at least try to give an answer, which will almost always be correct anyways.

replies(2): >>46223537 #>>46224362 #

44. MaxikCZ ◴[10 Dec 25 19:10 UTC] No.46222124{3}[source]▶

>>46220593 #

This. Last benchmarks of DSv3.2spe hinted at beating basically everything, yet in my testing even sonnet is miles ahead both in terms of speed and accuracy

45. artur44 ◴[10 Dec 25 19:19 UTC] No.46222239[source]▶

>>46220935 #

A simple way is to split the model’s output stream before TTS. Reasoning/structured tokens go into one bucket, actual user-facing text into another. Only the second bucket is synthesized. Most thinking out loud issues come from feeding the whole stream directly into audio.

replies(1): >>46223160 #

46. plipt ◴[10 Dec 25 19:23 UTC] No.46222308[source]▶

>>46221906 #

I see that their HuggingFace link goes to some Qwen3-Omni-30B-A3B models that show a last updated date of September

The benchmark table in their article shows Qwen3-Omni-Flash-2025-12-01 (and the previous Flash) as beating Qwen3-235B-A22B. How is that possible if this is only a 30B-A3B model? Also confusing how that comparison column starts out with one model but changes them as you descend down the table.

I don't see any FLASH variant listed on their Hugginface. Am i just missing it or are these specifying a model only used for their API service and there are no open weights to download?

47. apexalpha ◴[10 Dec 25 19:28 UTC] No.46222391[source]▶

>>46221906 #

I run these on a 48gb Mac because of the universal ram.

48. strangattractor ◴[10 Dec 25 19:30 UTC] No.46222417[source]▶

>>46219993 #

Maybe it thinks some of those 29 are in series:)

49. red2awn ◴[10 Dec 25 19:35 UTC] No.46222499[source]▶

>>46220742 #

This is a stack of models:

- 650M Audio Encoder

- 540M Vision Encoder

- 30B-A3B LLM

- 3B-A0.3B Audio LLM

- 80M Transformer/200M ConvNet audio token to waveform

This is a closed source weight update to their Qwen3-Omni model. They had a previous open weight release Qwen/Qwen3-Omni-30B-A3B-Instruct and a closed version Qwen3-Omni-Flash.

You basically can't use this model right now since none of the open source inference framework have the model fully implemented. It works on transformers but it's extremely slow.

50. plipt ◴[10 Dec 25 19:36 UTC] No.46222506{3}[source]▶

>>46221587 #

The link[1] at the top of their article to HuggingFace goes to some models named Qwen3-Omni-30B-A3B that were last updated in September. None of them have "Flash" in the name.

The benchmark table shows this Flash model beating their Qwen3-235B-A22B. I dont see how that is possible if it is a 30B-A3B model.

I don't see a mention of a parameter count anywhere in the article. Do you? This may not be an open weights model.

This article feels a bit deceptive

1: https://huggingface.co/collections/Qwen/qwen3-omni

51. red2awn ◴[10 Dec 25 19:37 UTC] No.46222530{4}[source]▶

>>46221209 #

The previous -Flash weight is closed source. They do have weights for the original model that is slightly behind in performance https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct

replies(1): >>46222889 #

52. red2awn ◴[10 Dec 25 19:38 UTC] No.46222544[source]▶

>>46220123 #

None of inference frameworks (vLLM/SGLang) supports the full model, let alone non-nvidia.

replies(3): >>46223310 #>>46223630 #>>46226911 #

53. mobilio ◴[10 Dec 25 19:39 UTC] No.46222558[source]▶

>>46220981 #

Yes - there is a way: https://github.com/ggml-org/whisper.cpp

replies(1): >>46223198 #

54. tgtweak ◴[10 Dec 25 19:39 UTC] No.46222569[source]▶

>>46220981 #

You can probably follow the vLLM instructions for omni here, then use the included voice demo html to interface with it:

https://github.com/QwenLM/Qwen3-Omni#vllm-usage

https://github.com/QwenLM/Qwen3-Omni?tab=readme-ov-file#laun...

55. plipt ◴[10 Dec 25 19:40 UTC] No.46222576[source]▶

>>46220742 #

I dont think the Flash model discussed in the article is 30B

Their benchmark table shows it beating Qwen3-235B-A22B

Does "Flash" in the name of a Qwen model indicate a model-as-a-service and not open weights?

replies(1): >>46222718 #

56. red2awn ◴[10 Dec 25 19:49 UTC] No.46222718{3}[source]▶

>>46222576 #

Flash is a closed weight version of https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct (it is 30B but with addtional training on top of the open weight release). They deploy the flash version on Qwen's own chat.

replies(1): >>46222880 #

57. vessenes ◴[10 Dec 25 19:52 UTC] No.46222768[source]▶

>>46219538 (OP) #

Interesting - when I asked the omni model at qwen.com what version it was, I got a testy "I don't have a version" and then was told my chat was blocked for inappropriate content. A second try asking for knowledge cutoff got me the more equivocal "2024, but I know stuff after that date, too".

No idea how to check if this is actually deployed on qwen.com right now.

replies(2): >>46222912 #>>46224358 #

58. vessenes ◴[10 Dec 25 19:54 UTC] No.46222792[source]▶

>>46220168 #

I'm not convinced its end-to-end multimodal - in that case, you'll have a speech synthesis section and this will be some of the result. You could test by having it sing or do some accents, or have it talk back to you in an accent you give it.

59. plipt ◴[10 Dec 25 19:59 UTC] No.46222880{4}[source]▶

>>46222718 #

Thanks

Was it being closed weight obvious to you from the article? Trying to understand why I was confused. Had not seen the "Flash" designation before

Also 30B models can beat a semi-recent 235B with just some additional training?

replies(1): >>46223169 #

60. coder543 ◴[10 Dec 25 20:00 UTC] No.46222889{5}[source]▶

>>46222530 #

Based on things I had read over the past several months, Qwen3-Flash seemed to just be a weird marketing term for the Qwen3-Omni-30B-A3B series, not a different model. If they are not the same, then that is interesting/confusing.

replies(1): >>46223113 #

61. zamadatix ◴[10 Dec 25 20:01 UTC] No.46222912[source]▶

>>46222768 #

> No idea how to check if this is actually deployed on qwen.com right now.

Assuming you mean qwen.ai, when you run a query it should take you to chat.qwen.ai with the list of models in the top left. None of the options appear to be the -Omni variant (at least when anonymously accessing it).

replies(1): >>46222942 #

62. vessenes ◴[10 Dec 25 20:03 UTC] No.46222942{3}[source]▶

>>46222912 #

Thanks - yes - I did. The blog post suggests clicking the 'voice' icon on the bottom right - that's what I did.

63. readyplayeremma ◴[10 Dec 25 20:05 UTC] No.46222976[source]▶

>>46221096 #

The models are right here, one of the first links in the post: https://huggingface.co/collections/Qwen/qwen3-omni

edit: Nevermind, in spite of them linking it at the top, they are the old models. Also, the HF demo is calling their API and not using HF for compute.

replies(1): >>46224501 #

64. red2awn ◴[10 Dec 25 20:15 UTC] No.46223113{6}[source]▶

>>46222889 #

It is an in-house closed weight model for their own chat platform, mentioned in Section 5 of the original paper: https://arxiv.org/pdf/2509.17765

I've seen it in their online materials too but can't seem to find it now.

65. ivape ◴[10 Dec 25 20:16 UTC] No.46223129[source]▶

>>46220123 #

That's exciting. I doubt there are any polished voice chat local apps yet that you can easily plug this into (I doubt the user experience is "there" yet). Even stuff like Silly Tavern is near unusable, lots of work to be done on the local front. Local voice models are what's going to enable that whole Minority Report workflow soon enough (especially if commands and intent are determined at the local level, and the meat of the prompt is handled by a larger remote model).

This is part of programming that I think is the new field. There will be tons of work for those that can build the new workflows which will need to be primarily natural language driven.

replies(1): >>46223285 #

66. pugio ◴[10 Dec 25 20:19 UTC] No.46223160{3}[source]▶

>>46222239 #

There is no TTS here. It's a native audio output model which outputs audio tokens directly. (At least, that's how the other real-time models work. Maybe I've misunderstood the Qwen-Omni architecture.)

replies(1): >>46223385 #

67. red2awn ◴[10 Dec 25 20:19 UTC] No.46223169{5}[source]▶

>>46222880 #

They had a Flash variant released alongside the original open weight release. It is also mentioned in Section 5 of the paper: https://arxiv.org/pdf/2509.17765

For the evals it's probably just trained on a lot of the benchmark adjacent datasets compared to the 235B model. Similar thing happened on other model today: https://x.com/NousResearch/status/1998536543565127968 (a 30B model trained specifically to do well in maths get near SOTA scores)

68. novaray ◴[10 Dec 25 20:21 UTC] No.46223198{3}[source]▶

>>46222558 #

Whisper and Qwen Omni models have completely different architectures as far as I know

69. sosodev ◴[10 Dec 25 20:27 UTC] No.46223285{3}[source]▶

>>46223129 #

I did find this app: https://github.com/gabber-dev/gabber

The creator posted a little demo of it working with Qwen3 Omni that is quite impressive: https://www.youtube.com/watch?v=5DBFVe3cLto

He didn't include any details regarding how the model was running though

70. sosodev ◴[10 Dec 25 20:29 UTC] No.46223310{3}[source]▶

>>46222544 #

That's unfortunate but not too surprising. This type of model is very new to the local hosting space.

71. artur44 ◴[10 Dec 25 20:35 UTC] No.46223385{4}[source]▶

>>46223160 #

True, but even with native audio-token models you still need to split the model’s output channels. Reasoning/internal tokens shouldn't go into the audio stream only user-facing content should be emitted as audio. The principle is the same, whether the last step is TTS or audio token generation.

replies(1): >>46232625 #

72. plufz ◴[10 Dec 25 20:44 UTC] No.46223537{5}[source]▶

>>46222088 #

That would be a great if you could have a setting like temperature 0.0-1.0 (Only answer if you are 100% to guess as much as you like).

73. red2awn ◴[10 Dec 25 20:45 UTC] No.46223561[source]▶

>>46220493 #

Why would you use an Omni model for text only workload... There is Qwen3-30B-A3B.

74. AndreSlavescu ◴[10 Dec 25 20:50 UTC] No.46223630{3}[source]▶

>>46222544 #

We actually deployed working speech to speech inference that builds on top of vLLM as the backbone. The main thing was to support the "Talker" module, which is currently not supported on the qwen3-omni branch for vLLM.

Check it out here: https://models.hathora.dev/model/qwen3-omni

replies(2): >>46223885 #>>46224354 #

75. red2awn ◴[10 Dec 25 21:08 UTC] No.46223885{4}[source]▶

>>46223630 #

Nice work. Are you working on streaming input/output?

replies(1): >>46223956 #

76. AndreSlavescu ◴[10 Dec 25 21:14 UTC] No.46223956{5}[source]▶

>>46223885 #

Yeah, that's something we currently support. Feel free to try the platform out! No cost to you for now, you just need a valid email to sign up on the platform.

replies(1): >>46229560 #

77. sosodev ◴[10 Dec 25 21:49 UTC] No.46224354{4}[source]▶

>>46223630 #

Is your work open source?

replies(1): >>46278997 #

78. mh- ◴[10 Dec 25 21:49 UTC] No.46224358[source]▶

>>46222768 #

For what it's worth, that's not a reliable way to check what model you're interacting with.

replies(1): >>46234830 #

79. littlestymaar ◴[10 Dec 25 21:50 UTC] No.46224362{5}[source]▶

>>46222088 #

It's not going to happen as the user would just leave the platform.

It would be better for most API usage though, as for business doing just a fraction of the job with 100% accuracy is often much preferable than claiming to do 100% but 20% is garbage.

80. aschobel ◴[10 Dec 25 22:00 UTC] No.46224501{3}[source]▶

>>46222976 #

It is super confusing. I also thought this initially was open weights.

81. potatoman22 ◴[10 Dec 25 22:34 UTC] No.46224919[source]▶

>>46220123 #

From what I can tell, their official chat site doesn't have a native audio -> audio model yet. I like to test this through homophones (e.g. record and record) and asking it to change its pitch or produce sounds.

replies(3): >>46225836 #>>46227448 #>>46227486 #

82. sosodev ◴[10 Dec 25 23:58 UTC] No.46225836{3}[source]▶

>>46224919 #

Huh, you're right. I tried your test and it clearly can't understand the difference between homophones. That seems to imply they're using some sort of TTS mechanism. Which is really weird because Qwen3-Omni claims to support direct audio input into the model. Maybe it's a cost saving measure?

replies(2): >>46227943 #>>46238306 #

83. forgingahead ◴[11 Dec 25 01:34 UTC] No.46226594[source]▶

>>46219538 (OP) #

I truly enjoy how the naming conventions seem to follow how I did homework assignments back in the day: finalpaper-1-dec2nd, finalpaper-2-dec4th, etc etc.

84. parineum ◴[11 Dec 25 02:14 UTC] No.46226858{5}[source]▶

>>46220437 #

What if you are trying to fix one of these things and needed a list of replacement parts?

replies(1): >>46226936 #

85. bongodongobob ◴[11 Dec 25 02:19 UTC] No.46226884[source]▶

>>46219993 #

Lol I asked it how many rooms I have in my house and it got that wrong. Llms are useless amirite

replies(1): >>46226905 #

86. whimsicalism ◴[11 Dec 25 02:23 UTC] No.46226911{3}[source]▶

>>46222544 #

Makes sense, I think streaming audio->audio inference is a relatively big lift.

replies(1): >>46229292 #

87. esafak ◴[11 Dec 25 02:28 UTC] No.46226936{6}[source]▶

>>46226858 #

Not the right problem for this model. Any RAG-backed SLM would do; the important part is being backed by a search engine, like https://google.com/ai

88. djtango ◴[11 Dec 25 03:53 UTC] No.46227448{3}[source]▶

>>46224919 #

Is record a homophone? At least in the UK we use different pronunciations for the meanings. Re-cord for the verb, rec-ord for the noun.

replies(1): >>46238269 #

89. dragonwriter ◴[11 Dec 25 03:59 UTC] No.46227486{3}[source]▶

>>46224919 #

“record and record”, if you mean the verb for persisting something and the noun for the thing persisted, are heteronyms (homographs which are not homophones), which incidentally is also what you would probably want to test what you are talking about here (distinguishing homophones would test use of context to understand meaning, but wouldn’t test anything about whether or not logic was working directly on audio or only working on text processed from audio, failing to distinguish heteronyms is suggestive of processing occurring on text, not audio directly.)

replies(2): >>46227622 #>>46238285 #

90. bakeman ◴[11 Dec 25 04:21 UTC] No.46227622{4}[source]▶

>>46227486 #

There are homophones of “record”, such as:

“He’s on record saying he broke the record for spinning a record.”

replies(1): >>46227911 #

91. dragonwriter ◴[11 Dec 25 05:13 UTC] No.46227911{5}[source]▶

>>46227622 #

True.

OTOH my point that the thing being suggested to be tested is not testable by seeing whether or not the system is capable of distinguishing homophones, but might be by seeing whether or not it distingishes heteronyms still stands. (The speculation that the record/record distinction intended was one that is actually a pair of heteronyms and that the error was merely the use of the word “homophone" in place of “heteronym”, rather than the basic logic of the comment is somewhat tangential to the main point.)

92. sosodev ◴[11 Dec 25 05:19 UTC] No.46227943{4}[source]▶

>>46225836 #

Weirdly, I just tried it again and it seems to understand the difference between record and record just fine. Perhaps if there's heavy demand for voice chat, like after a new release, they load shed by using TTS to a smaller model.

However, It still doesn't seem capable of producing any of the sounds, like laughter, that I would expect from a native voice model.

93. wkat4242 ◴[11 Dec 25 07:31 UTC] No.46228648{4}[source]▶

>>46221482 #

Yeah that overly fake-excited voice type. Doesn't work for Europe at all. But indeed common in American customer service scenarios.

94. mohsen1 ◴[11 Dec 25 08:15 UTC] No.46228886[source]▶

>>46219538 (OP) #

Having lots of success with Gemini Flash Live 2.5. I am hoping 3.0 to come out soon. Benchmarks here claim better results that Gemini Live but have to test it. In past I've always been disappointed with Qwen Omni models in my English-first case...

95. red2awn ◴[11 Dec 25 09:21 UTC] No.46229292{4}[source]▶

>>46226911 #

Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.

replies(1): >>46234434 #

96. andy_ppp ◴[11 Dec 25 09:54 UTC] No.46229484[source]▶

>>46220742 #

Haha, you could hear how it’s mind thinks, maybe by putting a lot of reverb on the thinking tokens or some other effect…

97. valleyer ◴[11 Dec 25 10:07 UTC] No.46229560{6}[source]▶

>>46223956 #

I tried this out, and it's not passing the record (n.) vs. record (v.) test mentioned elsewhere in this thread. (I can ask it to repeat one, and it often repeats the other.) Am I not enabling the speech-to-speech-ness somehow?

replies(1): >>46278969 #

98. skrunch ◴[11 Dec 25 10:41 UTC] No.46229791[source]▶

>>46220493 #

Except the image benchmarks are compared against 2.0, which seems suspicious that they would casually drop to an older model for those.

99. andy_ppp ◴[11 Dec 25 14:20 UTC] No.46231694[source]▶

>>46219538 (OP) #

Qwen seem to be deliberately confusing about if they are releasing models open weight or not. I think largely not any more and you can go on quite a wild goose chase looking for different things that are implied they are released but are actually only available via API.

100. regularfry ◴[11 Dec 25 15:32 UTC] No.46232625{5}[source]▶

>>46223385 #

There's an assumption there that the audio stream contains an equivalent of the <think>/</think> tokens. Every reason to think it should, but without seeing the tokeniser config it's a bit of a guess.

101. whimsicalism ◴[11 Dec 25 17:37 UTC] No.46234434{5}[source]▶

>>46229292 #

I imagine you have to start decoding many speculative completions in parallel to have true low latency.

102. vessenes ◴[11 Dec 25 18:08 UTC] No.46234830{3}[source]▶

>>46224358 #

It’s a good positive signal, but not a good negative one.

It would be convincing if it said “I’m qwen-2025-12-whatever”. I agree it’s not dispositive if it refuses or claims to be llama 3 say. Generally most models I talk to do not hallucinate future versions of themselves, in fact it can be quite difficult to get them to use recent model designations; they will often autocorrect to older models silently.

103. potatoman22 ◴[11 Dec 25 22:36 UTC] No.46238269{4}[source]▶

>>46227448 #

I was mistaken about what homophone means!

104. potatoman22 ◴[11 Dec 25 22:37 UTC] No.46238285{4}[source]▶

>>46227486 #

Ah I meant heteronyms. Thanks!

105. potatoman22 ◴[11 Dec 25 22:38 UTC] No.46238306{4}[source]▶

>>46225836 #

To be fair, discerning heteronyms might just be a gap in its training.

106. gardnr ◴[13 Dec 25 17:20 UTC] No.46256154{3}[source]▶

>>46221587 #

I was wrong. I confused this with their open model. Looking at it more closely, it is likely an omni version of Qwen3-235B-A22B. I wonder why they benchmarked it against Qwen2.5-Omni-7B instead of Qwen3-Omni-30B-A3B.

I wish I could delete the comment.

107. Alifatisk ◴[14 Dec 25 13:34 UTC] No.46262907[source]▶

>>46221096 #

It seems to be available on Qwen chat? https://chat.qwen.ai/settings/model?id=qwen3-omni-flash-2025...

108. AndreSlavescu ◴[15 Dec 25 19:14 UTC] No.46278969{7}[source]▶

>>46229560 #

From my understanding of the above problem, this would be something to do with the model weights. Have you tested this with the transformers inference baseline that is shown on huggingface?

In our deployment, we do not actually tune the model in any way, this is all just using the base instruct model provided on huggingface:

https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct

And with the potential concern around conversation turns, our platform is designed for one-off record -> response flows. But via the API, you can build your own conversation agent to use the model.

109. AndreSlavescu ◴[15 Dec 25 19:15 UTC] No.46278997{5}[source]▶

>>46224354 #

At the moment, no unfortunately. However, to my recent knowledge of open source alternatives, the vLLM team published a separate repository for omni models now:

https://github.com/vllm-project/vllm-omni

I have not yet tested out if this does full speech to speech, but this seems like a promising workspace for omni-modal models.

↑