Most active commenters
  • thot_experiment(6)

←back to thread

361 points mseri | 13 comments | | HN request time: 0s | source | bottom
1. thot_experiment ◴[] No.46002543[source]
Qwen3-30B-VL is going to be fucking hard to beat as a daily driver, it's so good for the base 80% of tasks I want an AI for, and holy fuck is it fast. 90tok/s on my machine, I pretty much keep it in vram permanently. I think this sort of work is important and I'm really glad it's being done, but in terms of something I want to use every day there's no way a dense model can compete unless it's smart as fuck. Even dumb models like Qwen3-30B get a lot of stuff right and not having to wait is amazing.
replies(3): >>46002752 #>>46005422 #>>46005940 #
2. psychoslave ◴[] No.46002752[source]
Thanks for the hint. I just tried it on a bright new Mac laptop, and it’s very slow here. But it led me to test qwen2.5:14b and it looks like it can create instant feedback loop.

It can even interact through fluent Esperanto, very nice.

replies(1): >>46002821 #
3. thot_experiment ◴[] No.46002821[source]
I'm specifically talking about qwen3-30b-a3b, the MoE model (this also applies to the big one). It's very very fast and pretty good, and speed matters when you're replacing basic google searches and text manipulation.
replies(1): >>46003208 #
4. a96 ◴[] No.46003208{3}[source]
I'm only superficially familiar with these, but curious. Your comment above mentioned the VL model. Isn't that a different model or is there an a3b with vision? Would it be better to have both if I'd like vision or does the vision model have the same abilities as the text models?
replies(2): >>46003560 #>>46004494 #
5. solarkraft ◴[] No.46003560{4}[source]
Looks like it: https://ollama.com/library/qwen3-vl:30b-a3b
replies(1): >>46008969 #
6. mark_l_watson ◴[] No.46004494{4}[source]
This has been my question also: I spend a lot of time experimenting with local models and almost all of my use cases involve text data, but having image processing and understanding would be useful.

How much do I give up (in performance, and running on my 32G M2Pro Mac) using the VL version of a model? For MOE models, hopefully not much.

replies(1): >>46008874 #
7. andai ◴[] No.46005422[source]
I'm out of the loop... so Qwen3-30B-VL is smart and Qwen3-30B is dumb... and that has to do not with the size but architecture?
replies(2): >>46005995 #>>46008928 #
8. comp_raccoon ◴[] No.46005940[source]
Olmo author here! Qwenmodels are in general amazing, but 30B is v fast cuz it’s an MoE. MoEs very much on the roadmap for next Olmo.
9. comp_raccoon ◴[] No.46005995[source]
Olmo author here, but I can help! First release of Qwen 3 left a lot of performance on the table bc they had some challenges balancing thinking and non-thinking modes. VL series has refreshed posttrain, so they are much better!
10. thot_experiment ◴[] No.46008874{5}[source]
all the qwen flavors have a VL version and it's a separate tensor stack, just a bit of vram if you want to keep it resident and vision-based queries take longer to process context but generation is still fast asf

i think the model itself is actually "smarter" because they split the thinking and instruct models so both modalities become better in their respective model

i use it almost exclusively to OCR handwritten todo lists into my todo app and i don't think it's missed yet, does a great job of toolcalling everything

11. thot_experiment ◴[] No.46008928[source]
ahaha sorry that was unclear, while i think the VL version is maybe a bit more performant, by "dumb" i meant any low quant low size model you're going to run locally, vs a "smart" model in my book is something like Opus 4.1 or Gemma 3.

I basically class LLM queries into two categories, there's stuff i expect most models to get, and there's stuff i expect only the smartest models to have a shot of getting right, there's some stuff in the middle ground that a quant model running locally might not get but something dumb but acceptable like Sonnet 4.5 or Kimi K2 might be able to handle.

I generally just stick to the two extremes and route my queries accordingly. I've been burned by sonnet 4.5/gpt-5 too many times to trust it.

replies(1): >>46010808 #
12. thot_experiment ◴[] No.46008969{5}[source]
fwiw on my machine it is 1.5x faster to inference in llama.cpp, these the settings i use for inference for the qwen i just keep in vram permanently

    llama-server --host 0.0.0.0 --model Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf --mmproj qwen3-VL-mmproj-F16.gguf --port 8080 --jinja --temp 0.7 --top-k 20 --top-p 0.8 -ngl 99 -c 65536 --repeat_penalty 1.0 --presence_penalty 1.5
13. thot_experiment ◴[] No.46010808{3}[source]
sorry i meant gemini 3