Congrats to Apple and Meta, makes sense they did the research, this will go towards efficient serving of LLMs on phones. And it's very easy to implement.
"zero-shot accuracy retention at 4- and 3-bit compression to be on par with or better than state-of-the-art methods, while maintaining performance comparable to FP16 baselines."
My reading of that says FP16 accuracy at Q3 or Q4 size / memory bandwidth. Which is a huge advantage.
* LLaMA 3 8B: baseline 72.26, 4-bit 71.31, 3-bit 62.79
* LLaMA 3 70B: baseline 79.51, 4-bit 78.06, 3-bit 74.68
These results seem comparable to modern quantization methods—for example, the ~4-bit results for smaller LLaMA models listed here: https://ai.meta.com/blog/meta-llama-quantized-lightweight-mo...
Also seems like the techniques may be possible to combine.
This technique seems a bit similar to lossy image compression that replaces exact pixels with a combination of pre-defined patterns (DCT in JPEG), but here the patterns aren't from cosine function, but from a pseudo-random one.
It may also be beating simple quantization from just adding noise that acts as dithering, and breaks up the bands created by combinations of quantized numbers.
This is my understanding as a non-expert.
LLM activations tend to be relatively sparse with large outliers. With linear quantization, this means you either have to clip off the outliers or you have to stretch your range to include the outliers, which wastes precious bits. Neither of these works well, so essentially all LLM quantization research is using various heuristics to get around these outliers. For example, you can do linear quantization but split the activations up into smaller blocks to make it less likely that any given block contains an outlier.
Another trick people have discovered (predates LLMs) is applying a random rotation/projection to the embeddings. This has the effect of making sure no one dimension in the vector dominates the others (which again hurts quantization). This works because in order for a single dimension to dominate, all the others have to "conspire" to be near zero. When you have 10,000+ dimensions, that's very unlikely.
This paper applies the latter trick. Instead of pre-generating the random projection matrices, they generate them on the fly on the accelerator from a seed that is fixed for each block. The seed is chosen from an offline brute-force search that needs only the weights of the network. This separates it from a lot of other quantization methods that either require calibration data or have to be simulated at training time so the network learns the quantization parameters itself.
You might think this is wasteful/might hurt performance, but it turns out that LLM inference is heavily memory-bound as it involves streaming a very large neural network into the accelerator (GPU/TPU/NPU/whatever) to operate on a relatively small amount of data, so there are lots of "free cycles" to generate these random numbers. Of course, if you care about power usage that might not be a great idea...
For technical documentation, I'm experimenting with a similar concept: instead of exhaustively documenting every implementation detail, defining a minimal set of principles and architectural decisions that allow "regenerating" the complete understanding.
Current LLMs excel at expanding compressed concepts, but we're still far from finding the optimal balance between explicit knowledge (detailed documentation) and implicit knowledge (patterns and principles). Is anyone working on systems applying similar ideas to technical knowledge management?
We need advancements like this if we want on-device AI to work well. This is the kind of thing Apple Silicon needs especially. It's weak relative to Nvidia consumer chips.
It covers some experiments on weight tying, one of which is actually LoRA and random weights.
IIUC they're transforming the data before compressing it. Also IIUC this is an established method.
Because of the nature of the data and the transform involved, you can get reasonable results with random numbers. That's already been done, but this work brute forces seeds to optimize the compression ratio and then derives the transform on the fly from the seed in order to save on memory bandwidth.
I feel like (again, non-expert) there are much deeper implications about current ML models here. The fact that a randomized transform can have this sort of impact seems to imply that there's much less information encoded by the data than we otherwise might expect given its sheer size.
Regarding Pi. You can't encode arbitrary data using arbitrary sequences and expect to come out ahead on average. But you can encode specific data using algorithms that exhibit specific behavior.