←back to thread

544 points tosh | 2 comments | | HN request time: 0.464s | source
Show context
simonw ◴[] No.43464243[source]
32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced Mac laptop (32GB or more).
replies(9): >>43464289 #>>43464380 #>>43464443 #>>43464588 #>>43464688 #>>43467991 #>>43468940 #>>43469099 #>>43470619 #
clear_view ◴[] No.43464289[source]
32B don't fully fit 16GB of VRAM. Still fine for higher quality answers, worth the extra wait in some cases.
replies(1): >>43464849 #
abraxas ◴[] No.43464849[source]
Would a 40GB A6000 fully accommodate a 32B model? I assume an fp16 quantization is still necessary?
replies(2): >>43465037 #>>43466059 #
manmal ◴[] No.43465037[source]
At FP16 you‘d need 64GB just for the weights, and it‘d be 2x as slow as a Q8 version, likely with little improvement. You‘ll also need space for attention and context etc, so 80-100GB (or even more) VRAM would be better.

Many people „just“ use 4x consumer GPUs like the 3090 (24GB each) which scales well. They’d probably buy a mining rig, EPYC CPU, Mainboard with sufficient PCIe lanes, PCIe risers, 1600W PSU (might need to limit the GPUs to 300W), and 128GB RAM. Depending what you pay for the GPUs that‘ll be 3.5-4.5k

replies(2): >>43465327 #>>43465589 #
1. abraxas ◴[] No.43465589[source]
would it be better for energy efficiency and overall performance to use workstation cards like A5000 or A4000? Those can be found on eBay.
replies(1): >>43466176 #
2. manmal ◴[] No.43466176[source]
Looks like the A4000 has low memory bandwidth (50% of a 4090?) which is the limiting factor for inference usually. But they are efficient - if you can get them for cheap, probably a good entry setup? If you like running models that need a lot of VRAM, you‘ll likely run out of PCIe slots before you are done upgrading.