SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

(machinelearning.apple.com)

1. elashri ◴[06 Apr 25 09:35 UTC] No.43600162[source]▶

>>43599967 (OP) #

I think it would be better to just link directly to the paper [1]. It is a work by researchers at Apple and Meta.

[1] https://arxiv.org/abs/2410.10714

2. anshumankmr ◴[06 Apr 25 10:12 UTC] No.43600316[source]▶

>>43599967 (OP) #

all this and they can't launch Apple intelligence on schedule :(

replies(3): >>43600352 #>>43602277 #>>43602445 #

3. timschmidt ◴[06 Apr 25 10:20 UTC] No.43600352[source]▶

>>43600316 #

Honestly, this seems like enabling work. Even the iPhone 16 Pro Max only seems to have 8GB RAM. If Apple Intelligence plans to do anything useful on-device, they need work like this.

replies(1): >>43602731 #

4. gblargg ◴[06 Apr 25 10:24 UTC] No.43600364[source]▶

>>43599967 (OP) #

It sounds like they basically find part of a pseudo-random sequence that is closest to the desired data, then store the random seed and corrections (which are small so take less space).

replies(1): >>43600384 #

5. joerick ◴[06 Apr 25 10:27 UTC] No.43600384[source]▶

>>43600364 #

Pretty fascinating from an information theory point of view. Surprising that it works at all. Is this, like, the JPEG of uniformly distributed, uncorrelated data?

replies(2): >>43600447 #>>43601665 #

6. visarga ◴[06 Apr 25 10:35 UTC] No.43600416[source]▶

>>43599967 (OP) #

Very interesting trick, using a dictionary of basis vectors which are quickly computed from a seed without storage. But the result is the same 3 or 4 bit quantization, with only a slight improvement. Their tiles are small, just 8 or 12 weights, it's why compression doesn't go too far. It would have been great if this trick lowered quantization <1 bit/weight, that would require longer tiles. Wondering what are the limits if we use a larger reservoir of cheap entropy as part of neural net architecture, even in training.

Congrats to Apple and Meta, makes sense they did the research, this will go towards efficient serving of LLMs on phones. And it's very easy to implement.

replies(2): >>43600451 #>>43601616 #

7. barotalomey ◴[06 Apr 25 10:46 UTC] No.43600447{3}[source]▶

>>43600384 #

You might find The Library of Babel fascinating [1, 2]

1: https://libraryofbabel.info/

2: https://news.ycombinator.com/item?id=9480949

8. kingsleyopara ◴[06 Apr 25 10:47 UTC] No.43600451[source]▶

>>43600416 #

I was about to post something similar. While the research is interesting, it doesn’t offer any advantages over 3- or 4-bit quantization. I also have to assume they explored using longer tiles but found it to be ineffective — which would make sense to me from an information theory perspective.

replies(3): >>43600460 #>>43601409 #>>43603433 #

9. timschmidt ◴[06 Apr 25 10:50 UTC] No.43600460{3}[source]▶

>>43600451 #

> it doesn’t offer any advantages over 3- or 4-bit quantization.

"zero-shot accuracy retention at 4- and 3-bit compression to be on par with or better than state-of-the-art methods, while maintaining performance comparable to FP16 baselines."

My reading of that says FP16 accuracy at Q3 or Q4 size / memory bandwidth. Which is a huge advantage.

replies(1): >>43600520 #

10. kingsleyopara ◴[06 Apr 25 11:02 UTC] No.43600520{4}[source]▶

>>43600460 #

For zero-shot accuracy from Table 3:

* LLaMA 3 8B: baseline 72.26, 4-bit 71.31, 3-bit 62.79

* LLaMA 3 70B: baseline 79.51, 4-bit 78.06, 3-bit 74.68

These results seem comparable to modern quantization methods—for example, the ~4-bit results for smaller LLaMA models listed here: https://ai.meta.com/blog/meta-llama-quantized-lightweight-mo...

replies(1): >>43600584 #

11. torginus ◴[06 Apr 25 11:15 UTC] No.43600579[source]▶

>>43599967 (OP) #

This sound like compression with extra steps.. What makes this technique particular to LLM weights instead of general purpose data?

replies(2): >>43600928 #>>43601244 #

12. timschmidt ◴[06 Apr 25 11:17 UTC] No.43600584{5}[source]▶

>>43600520 #

I don't see any comparable numbers on the page you linked. Seems to only have numbers for 1B and 3B parameter models. Comparisons to AWQ and OmniQuant in Table 3 seem quite favorable with SeedLM showing 10% - 50% better performance.

Also seems like the techniques may be possible to combine.

replies(1): >>43603140 #

13. pornel ◴[06 Apr 25 12:22 UTC] No.43600928[source]▶

>>43600579 #

Weights in neural networks don't always need to be precise. Not all weights are equally useful to the network. There seems to be a lot of redundancy that can be replaced with approximations.

This technique seems a bit similar to lossy image compression that replaces exact pixels with a combination of pre-defined patterns (DCT in JPEG), but here the patterns aren't from cosine function, but from a pseudo-random one.

It may also be beating simple quantization from just adding noise that acts as dithering, and breaks up the bands created by combinations of quantized numbers.

14. jsenn ◴[06 Apr 25 13:17 UTC] No.43601244[source]▶

>>43600579 #

> What makes this technique particular to LLM weights

This is my understanding as a non-expert.

LLM activations tend to be relatively sparse with large outliers. With linear quantization, this means you either have to clip off the outliers or you have to stretch your range to include the outliers, which wastes precious bits. Neither of these works well, so essentially all LLM quantization research is using various heuristics to get around these outliers. For example, you can do linear quantization but split the activations up into smaller blocks to make it less likely that any given block contains an outlier.

Another trick people have discovered (predates LLMs) is applying a random rotation/projection to the embeddings. This has the effect of making sure no one dimension in the vector dominates the others (which again hurts quantization). This works because in order for a single dimension to dominate, all the others have to "conspire" to be near zero. When you have 10,000+ dimensions, that's very unlikely.

This paper applies the latter trick. Instead of pre-generating the random projection matrices, they generate them on the fly on the accelerator from a seed that is fixed for each block. The seed is chosen from an offline brute-force search that needs only the weights of the network. This separates it from a lot of other quantization methods that either require calibration data or have to be simulated at training time so the network learns the quantization parameters itself.

You might think this is wasteful/might hurt performance, but it turns out that LLM inference is heavily memory-bound as it involves streaming a very large neural network into the accelerator (GPU/TPU/NPU/whatever) to operate on a relatively small amount of data, so there are lots of "free cycles" to generate these random numbers. Of course, if you care about power usage that might not be a great idea...

15. jsenn ◴[06 Apr 25 13:49 UTC] No.43601409{3}[source]▶

>>43600451 #

I think the main advantage is that you can compute the extra parameters (the PRNG seeds) from the network weights alone, whereas most other quantization methods require simulating the quantization procedure at training time (Quantization-Aware Training) or setting them from a calibration dataset (Post-Training Quantization)

16. jlcases ◴[06 Apr 25 14:09 UTC] No.43601527[source]▶

>>43599967 (OP) #

This compression approach reminds me of similarities with human knowledge transfer. In both cases, we're looking for compact representations that can reconstruct complex information.

For technical documentation, I'm experimenting with a similar concept: instead of exhaustively documenting every implementation detail, defining a minimal set of principles and architectural decisions that allow "regenerating" the complete understanding.

Current LLMs excel at expanding compressed concepts, but we're still far from finding the optimal balance between explicit knowledge (detailed documentation) and implicit knowledge (patterns and principles). Is anyone working on systems applying similar ideas to technical knowledge management?

17. samus ◴[06 Apr 25 14:27 UTC] No.43601616[source]▶

>>43600416 #

It should be definitely worth it because you can reuse databases of sequence to seed mappings for all future models.

18. samus ◴[06 Apr 25 14:34 UTC] No.43601665{3}[source]▶

>>43600384 #

We don't know. They basically look for sequences that approximate NN weights well, in the same way sinusoidal functions work well with "natural" images, but not with graphics with hard edges.

19. ◴[06 Apr 25 15:18 UTC] No.43602052[source]▶

>>43599967 (OP) #

20. brookst ◴[06 Apr 25 15:44 UTC] No.43602277[source]▶

>>43600316 #

Research vs productization. Very different.

21. EGreg ◴[06 Apr 25 15:52 UTC] No.43602366[source]▶

>>43599967 (OP) #

What did Zuck mean that Llama 4 Behemoth is already the highest performing base model and hasnt even done training yet? What are the benchmarks then?

Does he mean they did pretraining but not fine tuning?

replies(1): >>43605384 #

22. echelon ◴[06 Apr 25 16:01 UTC] No.43602445[source]▶

>>43600316 #

This unblocks future product work. You have to lay the groundwork.

We need advancements like this if we want on-device AI to work well. This is the kind of thing Apple Silicon needs especially. It's weak relative to Nvidia consumer chips.

23. manmal ◴[06 Apr 25 16:32 UTC] No.43602731{3}[source]▶

>>43600352 #

Personally, I‘m holding off on purchasing a new iPhone for this reason, even though my 13 Pro is getting long in the tooth as a full time iOS dev. The coming generation is rumored to have more and better memory (LPDDR5X?), and cooling (vapor chamber).

24. benob ◴[06 Apr 25 16:34 UTC] No.43602745[source]▶

>>43599967 (OP) #

A variant I have been thinking of: each parameter matrix (or block) is the sum of a random matrix (generated from a seed) and a low rank matrix (a LoRA). I'd like to experiment training from scratch in that setting.

replies(1): >>43602864 #

25. sadiq ◴[06 Apr 25 16:50 UTC] No.43602864[source]▶

>>43602745 #

There's a related write-up here you might find interesting: https://wandb.ai/learning-at-home/LM_OWT/reports/Parameter-s...

It covers some experiments on weight tying, one of which is actually LoRA and random weights.

26. _0ffh ◴[06 Apr 25 17:27 UTC] No.43603140{6}[source]▶

>>43600584 #

As a rule of thumb, the bigger the model is, the more graciously it degrades under quantisation. So you may assume performance loss for a 8B model would be lower than for a 3B model. (I know that doesn't make up for missing numbers in link, just fyi.)

27. _0ffh ◴[06 Apr 25 17:31 UTC] No.43603178[source]▶

>>43599967 (OP) #

I suspect an April fools joke.

In general, compression using PRNGs is not a thing. There might be a special exception for this case, but I somewhat doubt it. =)

replies(2): >>43603315 #>>43604559 #

28. RainyDayTmrw ◴[06 Apr 25 17:48 UTC] No.43603315[source]▶

>>43603178 #

The version on Arxiv dates to October 2024, which likely rules this out.

replies(1): >>43603462 #

29. RainyDayTmrw ◴[06 Apr 25 17:48 UTC] No.43603323[source]▶

>>43599967 (OP) #

How do you reconcile this with the (I believe) widely accepted idea that you can't meaningfully compress data using offsets into Pi?

replies(2): >>43604557 #>>43605578 #

30. hedgehog ◴[06 Apr 25 17:59 UTC] No.43603433{3}[source]▶

>>43600451 #

This technique has three significant advantages over popular low bit quantization: 1) it retains more accuracy, 2) it does not require calibration data, 3) it's easier to implement in hardware.

31. _0ffh ◴[06 Apr 25 18:02 UTC] No.43603462{3}[source]▶

>>43603315 #

I submit that there is no "compression into PRNG seeds" going on here. This is just a quantisation method that happens to leverage PRNs, which might have some specific advantages and disadvantages. What I am sure it does not do, is what it's title seems to claim, if taken literally. I suspect they're having a good laugh, getting away with what they must know is borderline trolling. I'm impressed!

32. htrp ◴[06 Apr 25 19:12 UTC] No.43604087[source]▶

>>43599967 (OP) #

Also from October 2024 (https://arxiv.org/abs/2410.10714)

33. fc417fc802 ◴[06 Apr 25 20:09 UTC] No.43604557[source]▶

>>43603323 #

Not an expert but my impression is that the title and intro are worded in a highly misleading manner.

IIUC they're transforming the data before compressing it. Also IIUC this is an established method.

Because of the nature of the data and the transform involved, you can get reasonable results with random numbers. That's already been done, but this work brute forces seeds to optimize the compression ratio and then derives the transform on the fly from the seed in order to save on memory bandwidth.

I feel like (again, non-expert) there are much deeper implications about current ML models here. The fact that a randomized transform can have this sort of impact seems to imply that there's much less information encoded by the data than we otherwise might expect given its sheer size.

Regarding Pi. You can't encode arbitrary data using arbitrary sequences and expect to come out ahead on average. But you can encode specific data using algorithms that exhibit specific behavior.

replies(1): >>43604759 #

34. threeseed ◴[06 Apr 25 20:10 UTC] No.43604559[source]▶

>>43603178 #

I suspect that you are literally the only person on this planet who would find this to be funny enough for Apple to waste the time of a dozen AI Researchers, Meta, arXiv and Apple Legal who vet everything.

35. fc417fc802 ◴[06 Apr 25 20:36 UTC] No.43604759{3}[source]▶

>>43604557 #

Maybe I'm wrong. Figure 2 seems to depict exactly what's described by the title, searching for a combination of random numbers that recovers an approximation of the weights. But if that's true then I have the same question about information theoretics that you posed above.

36. tintor ◴[06 Apr 25 22:10 UTC] No.43605384[source]▶

>>43602366 #

You can fine tune a checkpoint of model during pre-training.

37. diegoperini ◴[06 Apr 25 22:39 UTC] No.43605578[source]▶

>>43603323 #

You get to choose your own, more efficient "PI" for your model. Still, it's a valid question.