←back to thread

544 points tosh | 1 comments | | HN request time: 0.001s | source
Show context
simonw ◴[] No.43464243[source]
32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced Mac laptop (32GB or more).
replies(9): >>43464289 #>>43464380 #>>43464443 #>>43464588 #>>43464688 #>>43467991 #>>43468940 #>>43469099 #>>43470619 #
clear_view ◴[] No.43464289[source]
32B don't fully fit 16GB of VRAM. Still fine for higher quality answers, worth the extra wait in some cases.
replies(1): >>43464849 #
abraxas ◴[] No.43464849[source]
Would a 40GB A6000 fully accommodate a 32B model? I assume an fp16 quantization is still necessary?
replies(2): >>43465037 #>>43466059 #
elorant ◴[] No.43466059{3}[source]
You don't need 16-bit quantization. The difference in accuracy from 8-bit in most models is less than 5%.
replies(1): >>43469069 #
int_19h ◴[] No.43469069{4}[source]
Even 4-bit is fine.

To be more precise, it's not that there's no decrease in quality, it's that with the RAM savings you can fit a much better model. E.g. with LLaMA, if you start with 70b and increasingly quantize, you'll still get considerably better performance at 3 bit than LLaMA 33b running at 8bit.

replies(1): >>43470297 #
1. elorant ◴[] No.43470297{5}[source]
True. The only problem with lower quantization though is that the model fails to understand long prompts.