Most active commenters

bick_nyers(5)
CuriouslyC(3)
nl(3)

Popular/hot comments

>>35030868 #
>>35029601 #

←back to thread

Show HN: Llama-dl – high-speed download of LLaMA, Facebook's 65B GPT model

(github.com)

1. v64 ◴[05 Mar 23 11:44 UTC] No.35028738[source]▶

>>35026902 (OP) #

If anyone is interested in running this at home, please follow the llama-int8 project [1]. LLM.int8() is a recent development allowing LLMs to run in half the memory without loss of performance [2]. Note that at the end of [2]'s abstract, the authors state "This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software." I'm very thankful we have researchers like this further democratizing access to this data and prying it out of the hands of the gatekeepers who wish to monetize it.

[1] https://github.com/tloen/llama-int8

[2] https://arxiv.org/abs/2208.07339

replies(5): >>35028950 #>>35029068 #>>35029601 #>>35030214 #>>35030868 #

2. rnosov ◴[05 Mar 23 12:17 UTC] No.35028950[source]▶

>>35028738 (TP) #

Hmmm, the Github repo suggests that you might be able to run the 65B model on a single A100 80gb card. At the moment, the spot price on Google cloud for this card is $1.25/hour which makes it not so crazy expensive...

replies(1): >>35031058 #

3. downvotetruth ◴[05 Mar 23 12:34 UTC] No.35029068[source]▶

>>35028738 (TP) #

Eagerly awaiting the int8 vs 4 benchmarks. Also, it can run on CPU https://github.com/markasoftware/llama-cpu So, an int8 patch could allow the 65B to run on a standard 128 GB setup assuming the 65B model's cache bursts fit, which if I were to speculate is why the released models stop @ 65B & meta likely already has larger unreleased internal ones.

replies(1): >>35029092 #

4. v64 ◴[05 Mar 23 12:37 UTC] No.35029092[source]▶

>>35029068 #

early int4 experiments seem to indicate it's possible but you do lose performance, see this thread https://www.reddit.com/r/MachineLearning/comments/11i4olx/d_...

edit: to clarify, it may be possible to get this loss back and there is reason to be optimistic

replies(1): >>35029917 #

5. swyx ◴[05 Mar 23 13:54 UTC] No.35029601[source]▶

>>35028738 (TP) #

why is it that these models tend to be released as float16 and converting to int8 is left to the reader? is there something special about training that defaults you to float16?

replies(3): >>35030106 #>>35030321 #>>35033050 #

6. CuriouslyC ◴[05 Mar 23 14:36 UTC] No.35029917{3}[source]▶

>>35029092 #

Probably the best method is to just train it on int4 in the first place. Fine tuning after quantization would definitely help though.

replies(2): >>35030116 #>>35034771 #

7. sillysaurusx ◴[05 Mar 23 14:58 UTC] No.35030106[source]▶

>>35029601 #

They were trained in fp16, and researchers tend to release whatever format they trained. It’s hard enough to do a large release that it’s best not to try to have too many goals, for the same reason most software projects try not to do too much lest their schedule slip.

Still, I’m a little sad they didn’t release the optimizer weights. It would’ve given us so much valuable info about the dataset, among other benefits.

8. sp332 ◴[05 Mar 23 14:58 UTC] No.35030116{4}[source]▶

>>35029917 #

Isn't that backwards? You need fairly good resolution during training or your gradients will be pointing all over the place. Once you've found a good minimum point, moving a little away from it with reduced precision is probably OK.

replies(2): >>35030220 #>>35030326 #

9. nextaccountic ◴[05 Mar 23 15:09 UTC] No.35030214[source]▶

>>35028738 (TP) #

If the model weights are stored as int8, does this mean that the floating point capacity of the GPU is wasted? Or the int8 is converted to float in the GPU?

replies(1): >>35032440 #

10. brookst ◴[05 Mar 23 15:10 UTC] No.35030220{5}[source]▶

>>35030116 #

I have no idea what the right answer is, but I think the argument for int4 training is that the loss measurements would take the lower resolution of the model as a whole into account.

Is it better to have billions of high resolution parameters and quantize them at the end, or to train low resolution parameters where the training algorithms see the lower resolution? It’s beyond me, but I’d love to know.

replies(2): >>35031851 #>>35035673 #

11. dspillett ◴[05 Mar 23 15:20 UTC] No.35030321[source]▶

>>35029601 #

Precision, aiming those names refer to standard binary numeric types. IEEE754 16-bit floats carry 11 significant digits with absolute precision so by coverting to 8-bit integers you lose some of that. Depending on the distribution of the values in those floats you could be loosing a lot more detail then this would imply, which is the reason we use floating point numbers for anything in the first place (rather than using an int16 where you have greater precision at you maximum scale but much less at lower scales).

So if the model is computed using float16s, distribute as-is and let the end user choose to user it like that or compromise for faster processing of there system can deal with many billions of int8s more effectively.

replies(1): >>35039562 #

12. rfoo ◴[05 Mar 23 15:21 UTC] No.35030326{5}[source]▶

>>35030116 #

GP could be mentioning quantization aware training, during which the weight and gradient are still computed in fp16/fp32.

replies(1): >>35037181 #

13. causality0 ◴[05 Mar 23 16:10 UTC] No.35030868[source]▶

>>35028738 (TP) #

I feel like we're less than a decade away from being able to hook LLMs into gaming. How incredible would it be to have NPCs driven by LLM?

replies(5): >>35031124 #>>35031255 #>>35033211 #>>35034447 #>>35058462 #

14. nabla9 ◴[05 Mar 23 16:29 UTC] No.35031058[source]▶

>>35028950 #

$1.25/hour is roughly a year of GPU time until it exceeds the price of A100 80GB card.

replies(1): >>35032315 #

15. visarga ◴[05 Mar 23 16:37 UTC] No.35031124[source]▶

>>35030868 #

We'll soon have LLMs in operating systems, LLMs in browsers and you are right, probably also in games. LLMs will be the platform on which we build almost everything.

replies(1): >>35032463 #

16. SloopJon ◴[05 Mar 23 16:45 UTC] No.35031255[source]▶

>>35030868 #

There was an Ask HN post about that idea a couple of months ago:

https://news.ycombinator.com/item?id=34478503

I have long wished for less linear stories in video games, where branching narrative (a la Choose Your Own Adventure) is one possible way to give the player agency. The problem is, true branches are expensive, because you end up writing a bunch of content the player never experiences.

I see a lot of potential, but it's going to take a different kind of craftsmanship, and likely many iterations, to realize something more than a novelty.

replies(2): >>35031908 #>>35035818 #

17. Scene_Cast2 ◴[05 Mar 23 17:36 UTC] No.35031851{6}[source]▶

>>35030220 #

But by default, training algos don't see the lower resolution, your gradient just doesn't work as well. There is a body of research on how to make training aware of / adapt to the lower precision.

18. causality0 ◴[05 Mar 23 17:41 UTC] No.35031908{3}[source]▶

>>35031255 #

I much prefer handcrafted stories and quests. Characters that respond dynamically to the story and the player's actions, however, is quite tantalizing.

replies(1): >>35043708 #

19. metadat ◴[05 Mar 23 18:16 UTC] No.35032315{3}[source]▶

>>35031058 #

I think OP meant that $1.25/hr makes this accessible for people try it out themselves cost effectively, without having to spend thousands or tens of thousands up front to obtain a capable hardware rig.

Obviously $1.25/hr 24/7 does add up quickly, after one month the bill would come to $900.

replies(1): >>35032434 #

20. ◴[05 Mar 23 18:28 UTC] No.35032434{4}[source]▶

>>35032315 #

21. woodson ◴[05 Mar 23 18:28 UTC] No.35032440[source]▶

>>35030214 #

Well, tensor cores support int8 instructions (at least from Turing onwards), so the hardware is being used, if that’s your concern.

22. charcircuit ◴[05 Mar 23 19:32 UTC] No.35033050[source]▶

>>35029601 #

Quantization and other optimizations are more for productionizing models. You start with something accurate and then you start making tradeoffs to get the inference time to fit into your compute, memory, and time budgets.

23. bloaf ◴[05 Mar 23 19:48 UTC] No.35033211[source]▶

>>35030868 #

I'd be satisfied plugging a game log/history into a system that generates the epic tale of your victory/defeat.

24. pixl97 ◴[05 Mar 23 21:55 UTC] No.35034447[source]▶

>>35030868 #

Honestly I don't think it would be completely impossible now in a limited fashion.

Imagine playing a level and doing some particular feats in it. They get presented to GPT with a prompt and the story gets send to a AI voice model in game where the NPC asks/tells the player character about it.

25. nl ◴[05 Mar 23 22:22 UTC] No.35034771{4}[source]▶

>>35029917 #

> Probably the best method is to just train it on int4 in the first place

Unclear why you think that since experiments show the opposite.

In general the gradient seems to get too "bumpy" to do good gradient decent at lower levels of precision.

There are some papers showing that making the training loop aware of quantitization can help ultimate quantizied performance but I'm not aware of this being implemented at large scale.

replies(2): >>35035732 #>>35036328 #

26. bick_nyers ◴[06 Mar 23 00:01 UTC] No.35035673{6}[source]▶

>>35030220 #

I think the answer is it depends, and further, a dynamic approach may be best. Imagine you are going on a hike, and you have different maps at various resolutions (levels of detail). When planning the hike, you will want to see the zoomed out picture to get general directions, elevations and landmarks identified. Then you can zoom in to see the actual trails themselves, to identify your route, and then you zoom in even further when you are on the ground actually walking, avoiding obstacles along the way.

Different resolutions draw your attention to different types of features.

27. bick_nyers ◴[06 Mar 23 00:08 UTC] No.35035732{5}[source]▶

>>35034771 #

What if you smooth the gradient, either by interpolating/removing data points that make the surface "jagged", or maybe change the "window" of gradient descent, meaning instead of using a tangent (derivative, infinitesimally small window) you use a secant (???, window of specified length, likely calculated from the data space).

Forgive my lack of proper terminology here.

replies(1): >>35036021 #

28. bick_nyers ◴[06 Mar 23 00:16 UTC] No.35035818{3}[source]▶

>>35031255 #

In general I would say story =/= dialogue (which an LLM can much more easily be used for). I see two main "tricks" that would make the more complicated case (story) possible.

1. You bound the branching in a particular fashion, and provide overall "pressures" into certain story arcs.

2. You use generative AI in a LOT more places in the game.

What happens when you are playing a Sci-Fi game, and you get the enemy NPC to somehow hallucinate that he is the King of Dragons, but you don't have Dragon models/animations/movesets in your game files? You either bound the LLM to not hallucinate that, or you generate that dragon live. I guess a 3rd option, is your game is a comedy and the King NPC gets labeled a crazy person.

29. nl ◴[06 Mar 23 00:40 UTC] No.35036021{6}[source]▶

>>35035732 #

Sure, there are multiple ways to reduce the complexity of your loss-space, but the issue is that you usually want these small gradient values because they are important. Roughly if you "smooth over what appears to be a small hole" often you'll miss a large space that needs to be explored (obviously this is multi-dimensional but you get the idea).

However you can reduce memory by doing mixed-precision training if you are careful. See section "2.3.1. Loss Scaling To Preserve Small Gradient Magnitudes" in https://docs.nvidia.com/deeplearning/performance/mixed-preci...

replies(1): >>35036318 #

30. bick_nyers ◴[06 Mar 23 01:15 UTC] No.35036318{7}[source]▶

>>35036021 #

So then you would need to do some kind of mesh simplification that also preserves the topology, that makes sense.

I'm not quite sure I understand what they are describing in 2.3.1, are they scaling those small gradient magnitudes larger to try to "pull" you into those holes faster?

I was thinking the a way to go about it would be to just increase the "mesh resolution" near the small hole, which in this case would be use a larger precision in the area local to the hole.

replies(2): >>35038002 #>>35040827 #

31. CuriouslyC ◴[06 Mar 23 01:16 UTC] No.35036328{5}[source]▶

>>35034771 #

My take away was that the reduced performance of natively trained models was more about numerical instability related to training process than a statement about limitations of low precision models.

32. CuriouslyC ◴[06 Mar 23 03:23 UTC] No.35037181{6}[source]▶

>>35030326 #

It can go farther than that, it seems like the weight gradients are the main thing where the precision is a bottleneck (see https://arxiv.org/abs/1805.11046).

33. nl ◴[06 Mar 23 06:09 UTC] No.35038002{8}[source]▶

>>35036318 #

> are they scaling those small gradient magnitudes larger to try to "pull" you into those holes faster?

No, they are making the numbers bigger so the drop in precision doesn't lose details.

34. dspillett ◴[06 Mar 23 10:56 UTC] No.35039562{3}[source]▶

>>35030321 #

("aiming" should have been "assuming" in that second word – noticed far too late to correct, I really should stop using my phone's slide keyboard, either it or I or both are getting far less reliable)

35. lumost ◴[06 Mar 23 13:41 UTC] No.35040827{8}[source]▶

>>35036318 #

I suspect that changing the resolution around hot points in the manifold would be a more expensive task than training the model on a higher global resolution. Optimization algorithms currently do not maintain state on the loss-manifold.

replies(1): >>35042904 #

36. bick_nyers ◴[06 Mar 23 16:25 UTC] No.35042904{9}[source]▶

>>35040827 #

My naive (and I do mean naive) thought here is that you just need a cheap detection function of when you need to swap precision. I'm pretty stuck on the geometric interpretation here but basically if the training step is "within a radius" of a known hot point of the manifold then you swap precision. It's very possible though that I am hallucinating something that is not possible, I don't actually understand how this stuff really works yet.

replies(1): >>35050211 #

37. ElFitz ◴[06 Mar 23 17:17 UTC] No.35043708{4}[source]▶

>>35031908 #

We could have handcrafted stories and quests, with LLM-driven dialogues for NPCs canned responses (ie the infamous arrow and the proverbial knee).

And teams with limited resources could also still handcraft the stories and quests but use LLMs to generate or add some variety or context awareness to the dialogues, at a lower cost.

38. lumost ◴[07 Mar 23 01:13 UTC] No.35050211{10}[source]▶

>>35042904 #

The challenge here is knowing the shape of the manifold within an epsilon radius 65 Billion dimension sphere around the position being evaluated. To calculate this you would need to sample points within epsilon radius around the current point. As these points will be lower-precision by default, you would have minimal knowledge of the shape of the manifold within the sphere if epsilon is < the minimum precision.

It might be possible to work around this by estimating the gradient volatility through the n^th order derivatives, but you would then also have to deal with mixed precision SIMD which hardware doesn't really support.

39. ZunarJ5 ◴[07 Mar 23 17:45 UTC] No.35058462[source]▶

>>35030868 #

There are already several plugins for Unreal Engine. I am going to assume the same for Unity.

https://www.youtube.com/watch?v=i-Aw32rgM-w&ab_channel=Kella...

↑