Qwen2.5-VL-32B: Smarter and Lighter

(qwenlm.github.io)

544 points tosh | 1 comments | 24 Mar 25 18:35 UTC | HN request time: 0s | source

Show context

simonw ◴[24 Mar 25 18:53 UTC] No.43464243[source]▶

32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced Mac laptop (32GB or more).

replies(9): >>43464289 #>>43464380 #>>43464443 #>>43464588 #>>43464688 #>>43467991 #>>43468940 #>>43469099 #>>43470619 #

faizshah ◴[24 Mar 25 19:41 UTC] No.43464688[source]▶

>>43464243 #

I just started self hosting as well on my local machine, been using https://lmstudio.ai/ Locally for now.

I think the 32b models are actually good enough that I might stop paying for ChatGPT plus and Claude.

I get around 20 tok/second on my m3 and I can get 100 tok/second on smaller models or quantized. 80-100 tok/second is the best for interactive usage if you go above that you basically can’t read as fast as it generates.

I also really like the QwQ reaoning model, I haven’t gotten around to try out using locally hosted models for Agents and RAG especially coding agents is what im interested in. I feel like 20 tok/second is fine if it’s just running in the background.

Anyways would love to know others experiences, that was mine this weekend. The way it’s going I really dont see a point in paying, I think on-device is the near future and they should just charge a licensing fee like DB provider for enterprise support and updates.

If you were paying $20/mo for ChatGPT 1 year ago, the 32b models are basically at that level but slightly slower and slightly lower quality but useful enough to consider cancelling your subscriptions at this point.

replies(3): >>43464710 #>>43465059 #>>43470007 #

wetwater ◴[24 Mar 25 19:44 UTC] No.43464710[source]▶

>>43464688 #

Are there any good sources that I can read up on estimiating what would be hardware specs required for 7B, 13B, 32B .. etc size If I need to run them locally? I am grad student on budget but I want to host one locally and trying to build a PC that could run one of these models.

replies(6): >>43464785 #>>43464973 #>>43464999 #>>43465270 #>>43465970 #>>43468258 #

p_l ◴[24 Mar 25 20:48 UTC] No.43465270[source]▶

>>43464710 #

Generally, unquantized - double the number and that's the amount of VRAM in GB you need + some extra, because most models use fp16 weights so it's 2 bytes per parameter -> 32B parameters = 64GB

typical quantization to 4bit will cut 32B model into 16GB of weights plus some of the runtime data, which makes it possibly usable (if slow) on 16GB GPU. You can sometimes viably use smaller quantizations, which will reduce memory use even more.

replies(1): >>43470023 #

1. regularfry ◴[25 Mar 25 11:34 UTC] No.43470023[source]▶

>>43465270 #

You always want a bit of headroom for context. It's a problem I keep bumping into with 32B models on a 24GB card: the decent quants fit, but the context you have available on the card isn't quite as much as I'd like.

↑