Show HN: Llama-dl – high-speed download of LLaMA, Facebook's 65B GPT model

(github.com)

343 points sillysaurusx | 2 comments | 05 Mar 23 04:28 UTC | HN request time: 0.529s | source

Show context

v64 ◴[05 Mar 23 11:44 UTC] No.35028738[source]▶

If anyone is interested in running this at home, please follow the llama-int8 project [1]. LLM.int8() is a recent development allowing LLMs to run in half the memory without loss of performance [2]. Note that at the end of [2]'s abstract, the authors state "This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software." I'm very thankful we have researchers like this further democratizing access to this data and prying it out of the hands of the gatekeepers who wish to monetize it.

[1] https://github.com/tloen/llama-int8

[2] https://arxiv.org/abs/2208.07339

replies(5): >>35028950 #>>35029068 #>>35029601 #>>35030214 #>>35030868 #

downvotetruth ◴[05 Mar 23 12:34 UTC] No.35029068[source]▶

>>35028738 #

Eagerly awaiting the int8 vs 4 benchmarks. Also, it can run on CPU https://github.com/markasoftware/llama-cpu So, an int8 patch could allow the 65B to run on a standard 128 GB setup assuming the 65B model's cache bursts fit, which if I were to speculate is why the released models stop @ 65B & meta likely already has larger unreleased internal ones.

replies(1): >>35029092 #

v64 ◴[05 Mar 23 12:37 UTC] No.35029092[source]▶

>>35029068 #

early int4 experiments seem to indicate it's possible but you do lose performance, see this thread https://www.reddit.com/r/MachineLearning/comments/11i4olx/d_...

edit: to clarify, it may be possible to get this loss back and there is reason to be optimistic

replies(1): >>35029917 #

CuriouslyC ◴[05 Mar 23 14:36 UTC] No.35029917[source]▶

>>35029092 #

Probably the best method is to just train it on int4 in the first place. Fine tuning after quantization would definitely help though.

replies(2): >>35030116 #>>35034771 #

nl ◴[05 Mar 23 22:22 UTC] No.35034771[source]▶

>>35029917 #

> Probably the best method is to just train it on int4 in the first place

Unclear why you think that since experiments show the opposite.

In general the gradient seems to get too "bumpy" to do good gradient decent at lower levels of precision.

There are some papers showing that making the training loop aware of quantitization can help ultimate quantizied performance but I'm not aware of this being implemented at large scale.

replies(2): >>35035732 #>>35036328 #

bick_nyers ◴[06 Mar 23 00:08 UTC] No.35035732[source]▶

>>35034771 #

What if you smooth the gradient, either by interpolating/removing data points that make the surface "jagged", or maybe change the "window" of gradient descent, meaning instead of using a tangent (derivative, infinitesimally small window) you use a secant (???, window of specified length, likely calculated from the data space).

Forgive my lack of proper terminology here.

replies(1): >>35036021 #

nl ◴[06 Mar 23 00:40 UTC] No.35036021[source]▶

>>35035732 #

Sure, there are multiple ways to reduce the complexity of your loss-space, but the issue is that you usually want these small gradient values because they are important. Roughly if you "smooth over what appears to be a small hole" often you'll miss a large space that needs to be explored (obviously this is multi-dimensional but you get the idea).

However you can reduce memory by doing mixed-precision training if you are careful. See section "2.3.1. Loss Scaling To Preserve Small Gradient Magnitudes" in https://docs.nvidia.com/deeplearning/performance/mixed-preci...

replies(1): >>35036318 #

bick_nyers ◴[06 Mar 23 01:15 UTC] No.35036318[source]▶

>>35036021 #

So then you would need to do some kind of mesh simplification that also preserves the topology, that makes sense.

I'm not quite sure I understand what they are describing in 2.3.1, are they scaling those small gradient magnitudes larger to try to "pull" you into those holes faster?

I was thinking the a way to go about it would be to just increase the "mesh resolution" near the small hole, which in this case would be use a larger precision in the area local to the hole.

replies(2): >>35038002 #>>35040827 #

lumost ◴[06 Mar 23 13:41 UTC] No.35040827[source]▶

>>35036318 #

I suspect that changing the resolution around hot points in the manifold would be a more expensive task than training the model on a higher global resolution. Optimization algorithms currently do not maintain state on the loss-manifold.

replies(1): >>35042904 #

1. bick_nyers ◴[06 Mar 23 16:25 UTC] No.35042904[source]▶

>>35040827 #

My naive (and I do mean naive) thought here is that you just need a cheap detection function of when you need to swap precision. I'm pretty stuck on the geometric interpretation here but basically if the training step is "within a radius" of a known hot point of the manifold then you swap precision. It's very possible though that I am hallucinating something that is not possible, I don't actually understand how this stuff really works yet.

replies(1): >>35050211 #

2. lumost ◴[07 Mar 23 01:13 UTC] No.35050211[source]▶

>>35042904 (TP) #

The challenge here is knowing the shape of the manifold within an epsilon radius 65 Billion dimension sphere around the position being evaluated. To calculate this you would need to sample points within epsilon radius around the current point. As these points will be lower-precision by default, you would have minimal knowledge of the shape of the manifold within the sphere if epsilon is < the minimum precision.

It might be possible to work around this by estimating the gradient volatility through the n^th order derivatives, but you would then also have to deal with mixed precision SIMD which hardware doesn't really support.

↑