Qwen2.5-VL-32B: Smarter and Lighter

1. jauntywundrkind ◴[24 Mar 25 18:47 UTC] No.43464180[source]▶

Wish I knew better how to estimate what sized video card one needs. HuggingFace link says this is bfloat16, so at least 64GB?

I guess the -7B might run on my 16GB AMD card?

replies(4): >>43464207 #>>43464240 #>>43464303 #>>43464853 #

2. wgd ◴[24 Mar 25 18:50 UTC] No.43464207[source]▶

>>43464180 (TP) #

You can run 4-bit quantized version at a small (though nonzero) cost to output quality, so you would only need 16GB for that.

Also it's entirely possible to run a model that doesn't fit in available GPU memory, it will just be slower.

3. xiphias2 ◴[24 Mar 25 18:53 UTC] No.43464240[source]▶

>>43464180 (TP) #

I wish they would start producing graphs with quantized version performances as well. What matters is RAM/bandwidth vs performance, not number of parameters.

4. clear_view ◴[24 Mar 25 19:00 UTC] No.43464303[source]▶

>>43464180 (TP) #

deepseek-r1:14b/mistral-small:24b/qwen2.5-coder:14b fit 16GB VRAM with fast generation. 32b versions bleed into RAM and take a serious performance hit but still usable.

5. zamadatix ◴[24 Mar 25 20:01 UTC] No.43464853[source]▶

>>43464180 (TP) #

https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...

That will help you quickly calculate the model VRAM usage as well as the VRAM usage of the context length you want to use. You can put "Qwen/Qwen2.5-VL-32B-Instruct" in the "Model (unquantized)" field. Funnily enough the calculator lacks the option to see without quantizing the model, usually because nobody worried about VRAM bothers running >8 bit quants.

replies(1): >>43465510 #

6. azinman2 ◴[24 Mar 25 21:20 UTC] No.43465510[source]▶

>>43464853 #

Except when it comes to deepseek

replies(1): >>43466518 #

7. zamadatix ◴[24 Mar 25 23:30 UTC] No.43466518{3}[source]▶

>>43465510 #

For others not as familiar, this is pointing out DeepSeek-v3/DeepSeek-R1 are natively FP8 so selecting "Q8_0" aligns with not selecting quantization for that model (though you'll need ~1 TB of memory to use these model unquantized at full context). Importantly, this does not apply to the "DeepSeek" distills of other models, which retain natively being the same as the base model they distill.

I expect more and more worthwhile models to natively have <16 bit weights as time goes on but for the moment it's pretty much "8 bit DeepSeek and some research/testing models of various parameter width".

replies(1): >>43472502 #

8. azinman2 ◴[25 Mar 25 15:27 UTC] No.43472502{4}[source]▶

>>43466518 #

I wish deepseek distills were somehow branded differently. The amount of confusion I’ve come across from otherwise technical folk, or simply mislabeling (I’m running r1 on my MacBook!) is shocking. It’s my new pet peeve.