←back to thread

262 points rain1 | 1 comments | | HN request time: 0.208s | source
Show context
dale_glass ◴[] No.44442315[source]
How big are those in terms of size on disk and VRAM size?

Something like 1.61B just doesn't mean much to me since I don't know much about the guts of LLMs. But I'm curious about how that translates to computer hardware -- what specs would I need to run these? What could I run now, what would require spending some money, and what I might hope to be able to run in a decade?

replies(3): >>44442353 #>>44442714 #>>44450773 #
loudmax ◴[] No.44442714[source]
Most of these models have been trained using 16-bit weights. So a 1 billion parameter model takes up 2 gigabytes.

In practice, models can be quantized to smaller weights for inference. Usually, the performance loss going from 16 bit weights to 8 bit weights is very minor, so a 1 billion parameter model can take 1 gigabyte. Thinking about these models in terms of 8-bit quantized weights has the added benefit of making the math really easy. A 20B model needs 20G of memory. Simple.

Of course, models can be quantized down even further, at greater cost of inference quality. Depending on what you're doing, 5-bit weights or even lower might be perfectly acceptable. There's some indication that models that have been trained on lower bit weights might perform better than larger models that have been quantized down. For example, a model that was trained using 4-bit weights might perform better than a model that was trained at 16 bits, then quantized down to 4 bits.

When running models, a lot of the performance bottleneck is memory bandwidth. This is why LLM enthusiasts are looking for GPUs with the most possible VRAM. You computer might have 128G of RAM, but your GPU's access to that memory is so constrained by bandwidth that you might as well run the model on your CPU. Running a model on the CPU can be done, it's just much slower because the computation is so parallel.

Today's higher end consumer grade GPUs have up to 24G of dedicated VRAM (an Nvidia RTX 5090 has 32G of VRAM and they're like $2k). The dedicated VRAM on a GPU has a memory bandwidth of about 1 Tb/s. Apple's M-series of ARM-based CPU's have 512 Gb/s of bandwidth, and they're one of the most popular ways of being able to run larger LLMs on consumer hardware. AMD's new "Strix Halo" CPU+GPU chips have up to 128G of unified memory, with a memory bandwidth of about 256 Gb/s.

Reddit's r/LocalLLaMA is a reasonable place to look to see what people are doing with consumer grade hardware. Of course, some of what they're doing is bonkers so don't take everything you see there as a guide.

And as far as a decade from now, who knows. Currently, the top silicon fabs of TSMC, Samsung, and Intel are all working flat-out to meet the GPU demand from hyperscalers rolling out capacity (Microsoft Azure, AWS, Google, etc). Silicon chip manufacturing has traditionally followed a boom/bust cycle. But with geopolitical tensions, global trade barriers, AI-driven advances, and whatever other black swan events, what the next few years will look like is anyone's guess.

replies(1): >>44447611 #
1. ◴[] No.44447611[source]