(qwenlm.github.io)

544 points tosh | 1 comments | 24 Mar 25 18:35 UTC | HN request time: 0.264s | source

Show context

simonw ◴[24 Mar 25 18:53 UTC] No.43464243[source]▶

32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced Mac laptop (32GB or more).

replies(9): >>43464289 #>>43464380 #>>43464443 #>>43464588 #>>43464688 #>>43467991 #>>43468940 #>>43469099 #>>43470619 #

clear_view ◴[24 Mar 25 18:58 UTC] No.43464289[source]▶

>>43464243 #

32B don't fully fit 16GB of VRAM. Still fine for higher quality answers, worth the extra wait in some cases.

replies(1): >>43464849 #

abraxas ◴[24 Mar 25 20:01 UTC] No.43464849[source]▶

>>43464289 #

Would a 40GB A6000 fully accommodate a 32B model? I assume an fp16 quantization is still necessary?

replies(2): >>43465037 #>>43466059 #

elorant ◴[24 Mar 25 22:29 UTC] No.43466059[source]▶

>>43464849 #

You don't need 16-bit quantization. The difference in accuracy from 8-bit in most models is less than 5%.

replies(1): >>43469069 #

int_19h ◴[25 Mar 25 08:26 UTC] No.43469069[source]▶

>>43466059 #

Even 4-bit is fine.

To be more precise, it's not that there's no decrease in quality, it's that with the RAM savings you can fit a much better model. E.g. with LLaMA, if you start with 70b and increasingly quantize, you'll still get considerably better performance at 3 bit than LLaMA 33b running at 8bit.

replies(1): >>43470297 #

1. elorant ◴[25 Mar 25 12:14 UTC] No.43470297[source]▶

>>43469069 #

True. The only problem with lower quantization though is that the model fails to understand long prompts.

↑

Qwen2.5-VL-32B: Smarter and Lighter