Show HN: Llama-dl – high-speed download of LLaMA, Facebook's 65B GPT model

(github.com)

343 points sillysaurusx | 1 comments | 05 Mar 23 04:28 UTC | HN request time: 0.387s | source

Show context

v64 ◴[05 Mar 23 11:44 UTC] No.35028738[source]▶

If anyone is interested in running this at home, please follow the llama-int8 project [1]. LLM.int8() is a recent development allowing LLMs to run in half the memory without loss of performance [2]. Note that at the end of [2]'s abstract, the authors state "This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software." I'm very thankful we have researchers like this further democratizing access to this data and prying it out of the hands of the gatekeepers who wish to monetize it.

[1] https://github.com/tloen/llama-int8

[2] https://arxiv.org/abs/2208.07339

replies(5): >>35028950 #>>35029068 #>>35029601 #>>35030214 #>>35030868 #

downvotetruth ◴[05 Mar 23 12:34 UTC] No.35029068[source]▶

>>35028738 #

Eagerly awaiting the int8 vs 4 benchmarks. Also, it can run on CPU https://github.com/markasoftware/llama-cpu So, an int8 patch could allow the 65B to run on a standard 128 GB setup assuming the 65B model's cache bursts fit, which if I were to speculate is why the released models stop @ 65B & meta likely already has larger unreleased internal ones.

replies(1): >>35029092 #

v64 ◴[05 Mar 23 12:37 UTC] No.35029092[source]▶

>>35029068 #

early int4 experiments seem to indicate it's possible but you do lose performance, see this thread https://www.reddit.com/r/MachineLearning/comments/11i4olx/d_...

edit: to clarify, it may be possible to get this loss back and there is reason to be optimistic

replies(1): >>35029917 #

CuriouslyC ◴[05 Mar 23 14:36 UTC] No.35029917[source]▶

>>35029092 #

Probably the best method is to just train it on int4 in the first place. Fine tuning after quantization would definitely help though.

replies(2): >>35030116 #>>35034771 #

sp332 ◴[05 Mar 23 14:58 UTC] No.35030116[source]▶

>>35029917 #

Isn't that backwards? You need fairly good resolution during training or your gradients will be pointing all over the place. Once you've found a good minimum point, moving a little away from it with reduced precision is probably OK.

replies(2): >>35030220 #>>35030326 #

rfoo ◴[05 Mar 23 15:21 UTC] No.35030326[source]▶

>>35030116 #

GP could be mentioning quantization aware training, during which the weight and gradient are still computed in fp16/fp32.

replies(1): >>35037181 #

1. CuriouslyC ◴[06 Mar 23 03:23 UTC] No.35037181[source]▶

>>35030326 #

It can go farther than that, it seems like the weight gradients are the main thing where the precision is a bottleneck (see https://arxiv.org/abs/1805.11046).

↑