Llama.cpp 30B runs with only 6GB of RAM now

1. detrites ◴[31 Mar 23 20:58 UTC] No.35393558[source]▶

The pace of collaborative OSS development on these projects is amazing, but the rate of optimisations being achieved is almost unbelievable. What has everyone been doing wrong all these years cough sorry, I mean to say weeks?

Ok I answered my own question.

replies(5): >>35393627 #>>35393885 #>>35393921 #>>35394786 #>>35397029 #

2. datadeft ◴[31 Mar 23 21:04 UTC] No.35393627[source]▶

>>35393558 (TP) #

I have predicted that LLaMA will be available on mobile phones before the end of this year. We are very close.

replies(2): >>35393655 #>>35393662 #

3. terafo ◴[31 Mar 23 21:06 UTC] No.35393655[source]▶

>>35393627 #

You mean in contained app? It can already run on a phone. GPU acceleration would be nice at this point, though.

replies(1): >>35407867 #

4. rickrollin ◴[31 Mar 23 21:06 UTC] No.35393662[source]▶

>>35393627 #

People have actually ran it on phones.

replies(1): >>35407869 #

5. politician ◴[31 Mar 23 21:25 UTC] No.35393885[source]▶

>>35393558 (TP) #

Roughly: OpenAIs don’t employ enough jarts.

In other words, the groups of folks working on training models don’t necessarily have access to the sort of optimization engineers that are working in other areas.

When all of this leaked into the open, it caused a lot of people knowledgeable in different areas to put their own expertise to the task. Some of those efforts (mmap) pay off spectacularly. Expect industry to copy the best of these improvements.

replies(5): >>35393957 #>>35394249 #>>35394957 #>>35396490 #>>35397974 #

6. kmeisthax ◴[31 Mar 23 21:28 UTC] No.35393921[source]▶

>>35393558 (TP) #

>What has everyone been doing wrong all these years

So it's important to note that all of these improvements are the kinds of things that are cheap to run on a pretrained model. And all of the developments involving large language models recently have been the product of hundreds of thousands of dollars in rented compute time. Once you start putting six digits on a pile of model weights, that becomes a capital cost that the business either needs to recuperate or turn into a competitive advantage. So everyone who scales up to this point doesn't release model weights.

The model in question - LLaMA - isn't even a public model. It leaked and people copied[0] it. But because such a large model leaked, now people can actually work on iterative improvements again.

Unfortunately we don't really have a way for the FOSS community to pool together that much money to buy compute from cloud providers. Contributions-in-kind through distributed computing (e.g. a "GPT@home" project) would require significant changes to training methodology[1]. Further compounding this, the state-of-the-art is actually kind of a trade secret now. Exact training code isn't always available, and OpenAI has even gone so far as to refuse to say anything about GPT-4's architecture or training set to prevent open replication.

[0] I'm avoiding the use of the verb "stole" here, not just because I support filesharing, but because copyright law likely does not protect AI model weights alone.

[1] AI training has very high minimum requirements to get in the door. If your GPU has 12GB of VRAM and your model and gradients require 13GB, you can't train the model. CPUs don't have this limitation but they are ridiculously inefficient for any training task. There are techniques like ZeRO to give pagefile-like state partitioning to GPU training, but that requires additional engineering.

replies(7): >>35393979 #>>35394466 #>>35395609 #>>35396273 #>>35400202 #>>35400942 #>>35573426 #

7. bee_rider ◴[31 Mar 23 21:32 UTC] No.35393957[source]▶

>>35393885 #

The professional optimizes well enough to get management off their back, the hobbyist can be irrationally good.

replies(1): >>35395341 #

8. terafo ◴[31 Mar 23 21:35 UTC] No.35393979[source]▶

>>35393921 #

AI training has very high minimum requirements to get in the door. If your GPU has 12GB of VRAM and your model and gradients require 13GB, you can't train the model. CPUs don't have this limitation but they are ridiculously inefficient for any training task. There are techniques like ZeRO to give pagefile-like state partitioning to GPU training, but that requires additional engineering.

You can't if you have one 12gb gpu. You can if you have couple of dozens. And then petals-style training is possible. It is all very very new and there are many unsolved hurdles, but I think it can be done.

replies(3): >>35394356 #>>35394585 #>>35395800 #

9. hedgehog ◴[31 Mar 23 22:00 UTC] No.35394249[source]▶

>>35393885 #

They have very good people but those people have other priorities.

10. dplavery92 ◴[31 Mar 23 22:09 UTC] No.35394356{3}[source]▶

>>35393979 #

Sure, but when one 12gb GPU costs ~$800 new (e.g. for the 3080 LHR), "a couple of dozens" of them is a big barrier to entry to the hobbyist, student, or freelancer. And cloud computing offers an alternative route, but, as stated, distribution introduces a new engineering task, and the month-to-month bills for the compute nodes you are using can still add up surprisingly quickly.

replies(1): >>35394546 #

11. seydor ◴[31 Mar 23 22:18 UTC] No.35394466[source]▶

>>35393921 #

> we don't really have a way for the FOSS community to pool together that much money

There must be open source projects with enough money to pool into such a project. I wonder whether wikimedia or apache are considering anything.

replies(2): >>35395484 #>>35397575 #

12. terafo ◴[31 Mar 23 22:23 UTC] No.35394546{4}[source]▶

>>35394356 #

We are talking groups, not individuals. I think it is quite possible for couple of hundreds of people to cooperate and train something at least as big as LLaMa 7B in a week or two.

13. webnrrd2k ◴[31 Mar 23 22:27 UTC] No.35394585{3}[source]▶

>>35393979 #

Maybe a good candidate for the SETI@home treatment?

replies(1): >>35394635 #

14. terafo ◴[31 Mar 23 22:31 UTC] No.35394635{4}[source]▶

>>35394585 #

It is a good candidate. Tech is good 6-18 months away, though.

replies(1): >>35395229 #

15. xienze ◴[31 Mar 23 22:46 UTC] No.35394786[source]▶

>>35393558 (TP) #

> but the rate of optimisations being achieved is almost unbelievable. What has everyone been doing wrong all these years cough sorry, I mean to say weeks?

It’s several things:

* Cutting-edge code, not overly concerned with optimization

* Code written by scientists, who aren’t known for being the world’s greatest programmers

* The obsession the research world has with using Python

Not surprising that there’s a lot of low-hanging fruit that can be optimized.

replies(2): >>35394897 #>>35397540 #

16. Miraste ◴[31 Mar 23 22:57 UTC] No.35394897[source]▶

>>35394786 #

Why does Python get so much flak for inefficiencies? It's really not that slow, and in ML the speed-sensitive parts are libraries in lower level languages anyway. Half of the optimization from this very post is in Python.

replies(3): >>35395016 #>>35395401 #>>35442394 #

17. dsjoerg ◴[31 Mar 23 23:04 UTC] No.35394957[source]▶

>>35393885 #

~Whats a jart?~

Ah I see https://news.ycombinator.com/user?id=jart

replies(1): >>35395087 #

18. chatmasta ◴[31 Mar 23 23:10 UTC] No.35395016{3}[source]▶

>>35394897 #

It might not be slow in general, but it's easy to write slow code in it.

replies(1): >>35397547 #

19. sp332 ◴[31 Mar 23 23:18 UTC] No.35395087{3}[source]▶

>>35394957 #

In March 2014, Tunney petitioned the US government on We the People to hold a referendum asking for support to retire all government employees with full pensions, transfer administrative authority to the technology industry, and appoint the executive chairman of Google Eric Schmidt as CEO of America

https://en.m.wikipedia.org/wiki/Justine_Tunney

replies(1): >>35396194 #

20. nullsense ◴[31 Mar 23 23:35 UTC] No.35395229{5}[source]▶

>>35394635 #

How much faster can we develop the tech if we leverage GPT-4 to do it?

21. fallous ◴[31 Mar 23 23:47 UTC] No.35395341{3}[source]▶

>>35393957 #

The professional operates within prioritization tranches. Make it work, make it reliable, make it fast, make it pretty. If you're still iterating on proof-of-concept/prototyping you'll generally confine yourself to the first and/or second levels. Once you've settled on a finalized prototype you then follow the rest of the prioritization levels to achieve shippable product.

22. chaorace ◴[31 Mar 23 23:54 UTC] No.35395401{3}[source]▶

>>35394897 #

Python has the misfortune of competing against JS in this arena, which just so happens to have the most obsessively optimized JIT ever.

replies(1): >>35432517 #

23. sceadu ◴[01 Apr 23 00:06 UTC] No.35395484{3}[source]▶

>>35394466 #

Maybe we can repurpose the SETI@home infrastructure :)

replies(1): >>35396801 #

24. Tryk ◴[01 Apr 23 00:21 UTC] No.35395609[source]▶

>>35393921 #

>Unfortunately we don't really have a way for the FOSS community to pool together that much money to buy compute from cloud providers.

How so? Why couldn't we just start a gofundme/kickstarter to fund the training of an open-source model?

replies(1): >>35396185 #

25. 3np ◴[01 Apr 23 00:47 UTC] No.35395800{3}[source]▶

>>35393979 #

One thing I don't understand: If it's possible to chunk and parallelize it, is it not relatively straightforward to do these chunks sequentially on a single GPU with a roughly linear increase in runtime? Or are the parallelized computations actually interdependent and involving message-passing, making this unfeasible?

replies(1): >>35396618 #

26. hackernewds ◴[01 Apr 23 01:43 UTC] No.35396185{3}[source]▶

>>35395609 #

who will be entrusted to benefit from this? recall that OpenAI did begin as an open source project. and that they chose to go the capitalist route, despite initially explicitly stating on their site they were a non-profit

replies(1): >>35396963 #

27. hackernewds ◴[01 Apr 23 01:44 UTC] No.35396194{4}[source]▶

>>35395087 #

what's your point? also interestingly JART is in the thread here, so they might have read your comment :)

replies(1): >>35401864 #

28. chii ◴[01 Apr 23 01:59 UTC] No.35396273[source]▶

>>35393921 #

> Exact training code isn't always available, and OpenAI has even gone so far as to refuse to say anything about GPT-4's architecture or training set to prevent open replication.

this is why i think the patent and copyright system is a failure. The idea that having laws protecting information like this would advance the progress of science.

It doesn't, because look how an illegally leaked model gets much more advances in shorter time. The laws protecting IP merely gives a moat to incumbents.

replies(2): >>35396877 #>>35400870 #

29. dflock ◴[01 Apr 23 02:39 UTC] No.35396490[source]▶

>>35393885 #

The jart: the standard unit of developer.

"I've got 20 yrs experience, and I think I'm about 150 milli jarts, maybe 200 on a good day."

30. latency-guy2 ◴[01 Apr 23 02:57 UTC] No.35396618{4}[source]▶

>>35395800 #

Data moving back and forth from CPU to to bus to GPU and back again for however many chunked model parts you have would increase training time far beyond what you would be willing to invest, not to mention how inefficient and power intensive it is, far more power needed than doing just CPU only or GPU only training. Back to the time part - it's not linear at all. IMO its easily quadratic.

It's not unfeasible, in fact that's how things were done before lots of improvements to the various libraries in essence, many corps still have poorly built pipelines that spend a lot of time in CPU land and not enough in GPU land.

Just an FYI as well - intermediate outputs of models are used in quite a bit of ML, you may see them in some form being used for hyperparameter optimization and searching.

31. kmeisthax ◴[01 Apr 23 03:26 UTC] No.35396801{4}[source]▶

>>35395484 #

BOINC might be usable but the existing distributed training setups assume all nodes have very high speed I/O so they can trade gradients and model updates around quickly. The kind of setup that's feasible for BOINC is "here's a dataset shard, here's the last epoch, send me back gradients and I'll average them with the other ones I get to make the next epoch". This is quite a bit different from, say, the single-node case which is entirely serial and model updates happen every step rather than epoch.

32. hotpotamus ◴[01 Apr 23 03:40 UTC] No.35396877{3}[source]▶

>>35396273 #

The idea of a patent was always a time-limited monopoly and in exchange you would reveal trade secrets that could presumably advance science. I think like many aspects of modernity, it's a bit outmoded these days, particularly in software, but that was the idea. Copyright was similar, but it did not last nearly as long as it does today in the original US incarnations.

33. kmeisthax ◴[01 Apr 23 03:59 UTC] No.35396963{4}[source]▶

>>35396185 #

OpenAI specifically cited scaling costs as a reason for why they switched their org structure from non-profit to "capped profit"[0].

You could potentially crowdfund this, though I should point out that this was already tried and Kickstarter shut it down. The effort in question, "Unstable Diffusion", was kinda sketchy, promising a model specifically tuned for NSFW work. What you'd want is an organization that's responsible, knows how to use state of the art model architectures, and at least is willing to try and stop generative porn.

Which just so happens to be Stability AI. Except they're funded as a for-profit on venture capital, not as a something you can donate to on Kickstarter or Patreon.

If they were to switch from investor subsidy to crowdfunding, however, I'm not entirely sure people would actually be lining up to bear the costs of training. To find out why we need to talk about motive. We can broadly subdivide the users of generative AI into a few categories:

- Companies, who view AI as a way to either juice stock prices by promising a permanent capitalist revolution that will abolish the creative working class. They do not care about ownership, they care about balancing profit and loss. Insamuch as they want AI models not controlled by OpenAI, it is a strategic play, not a moral one.

- Artists of varying degrees of competence who use generative AI to skip past creative busywork such as assembling references or to hack out something quickly. Insamuch as they have critiques of how AI is owned, it is specifically that they do not want to be abolished by capitalists using their own labor as ground meat for the linear algebra data blender. So they are unlikely to crowdfund the thing they are angry is going to put them out of a job.

- No-hopers and other creatively bankrupt individuals who have been sold a promise that AI is going to fix their lack of talent by making talent obsolete. This is, of course, a lie[2]. They absolutely would prefer a model unencumbered by filters on cloud servers or morality clauses in licensing agreements, but they do not have the capital in aggregate to fund such an endeavor.

- Free Software types that hate OpenAI's about-face on open AI. Oddly enough, they also have the same hangups artists do, because much of FOSS is based on copyleft/Share-Alike clauses in the GPL, which things like GitHub Copilot is not equipped to handle. On the other hand they probably would be OK with it if the model was trained on permissive sources and had some kind of regurgitation detector. Consider this one a wildcard.

- Evildoers. This could be people who want a cheaper version of GPT-4 that hasn't been Asimov'd by OpenAI so they can generate shittons of spam. Or people who want a Stable Diffusion model that's really good at making nonconsensual deepfake pornography so they can fuck with people's heads. This was the explicit demographic that "Unstable Diffusion" was trying to target. Problem is, cybercriminals tend to be fairly unsophisticated, because the people who actually know how to crime with impunity would rather make more money in legitimate business instead.

Out of five demographics I'm aware of, two have capital but no motive, two have motive but no capital, and one would have both - but they already have a sour taste in their mouth from the creep-tech vibes that AI gives off.

[0] In practice the only way that profit cap is being hit is if they upend the economy so much that it completely decimates all human labor, in which case they can just overthrow the government and start sending out Terminators to kill the working class[1].

[1] God damn it why do all the best novel ideas have to come by when I'm halfway through another fucking rewrite of my current one

[2] Getting generative AI to spit out good writing or art requires careful knowledge of the model's strengths and limitations. Like any good tool.

replies(1): >>35398659 #

34. lannisterstark ◴[01 Apr 23 04:10 UTC] No.35397029[source]▶

>>35393558 (TP) #

And THIS is why I always advocate for democratization of AI models, getting these into hands of open source community and letting people use it as they wish etc.

But a lot of people would rather only have govt or corp control of it...

35. Closi ◴[01 Apr 23 05:56 UTC] No.35397540[source]▶

>>35394786 #

I’m not sure this is fair - a lot of the performance optimisations have come from applied mathematicians rather than programmers, and python is not generally considered to be the bottleneck (it is the interface rather than what is running the computation - it calls a C API which then often uses CUDA and may also run on hardware specifically designed for ML).

36. Closi ◴[01 Apr 23 05:58 UTC] No.35397547{4}[source]▶

>>35395016 #

In ML python is effectively the interface rather than the bit that is doing the heavy lifting.

The interface is designed to be easy to use (python) and the bit that is actually doing the work is designed to be heavily performant (which is C & CUDA and may even be running on a TPU).

37. totony ◴[01 Apr 23 06:03 UTC] No.35397575{3}[source]▶

>>35394466 #

Or big cloud platform could give some compute for free, give back some of the profit they get from oss.

38. fy20 ◴[01 Apr 23 07:13 UTC] No.35397974[source]▶

>>35393885 #

I'd say it's probably not a priority for them right now.

Of course it would save them some money if they could run their models on cheaper hardware, but they've raised $11B so I don't think that's much of a concern right now. Better to spend the efforts on pushing the model forward, which some of these optimisations may make harder.

replies(1): >>35403525 #

39. mx20 ◴[01 Apr 23 09:19 UTC] No.35398659{5}[source]▶

>>35396963 #

Maybe a lot of people/companies also don't want to give their data and knowledge to OpenAI, so that they can sell it off to the competition.

replies(1): >>35403868 #

40. JBorrow ◴[01 Apr 23 13:36 UTC] No.35400202[source]▶

>>35393921 #

$n00k of compute time is nothing, sorry. This is the kind of thing that academic institutions can give out for free…

replies(1): >>35400271 #

41. linesinthesand ◴[01 Apr 23 13:46 UTC] No.35400271{3}[source]▶

>>35400202 #

why don't you write the check then huh

replies(1): >>35406593 #

42. breck ◴[01 Apr 23 15:07 UTC] No.35400870{3}[source]▶

>>35396273 #

> The laws protecting IP merely gives a moat to incumbents.

Yes. These laws are bad. We could fix this with a 2 line change:

    Section 1. Article I, Section 8, Clause 8 of this Constitution is hereby repealed.
    Section 2. Congress shall make no law abridging the right of the people to publish information.

replies(1): >>35401622 #

43. barbariangrunge ◴[01 Apr 23 15:17 UTC] No.35400942[source]▶

>>35393921 #

Can the foss community really find nobody with the motivation to use their Bitcoin rig as a research machine? Or do you need even more specialized hardware than that?

replies(1): >>35401274 #

44. stolsvik ◴[01 Apr 23 15:56 UTC] No.35401274{3}[source]▶

>>35400942 #

The bitcoin rig is specialised - it can only compute SHA hashes. You need more general compute.

45. kmeisthax ◴[01 Apr 23 16:35 UTC] No.35401622{4}[source]▶

>>35400870 #

Abolishing the copyright clause would not solve this problem because OpenAI is not leveraging copyright or patents. They're just not releasing anything.

To fix this, you'd need to ban trade secrecy entirely. As in, if you have some kind of invention or creative work you must publish sufficient information to replicate it "in a timely manner". This would be one of those absolutely insane schemes that only a villain in an Ayn Rand book would come up with.

replies(1): >>35404345 #

46. sp332 ◴[01 Apr 23 16:59 UTC] No.35401864{5}[source]▶

>>35396194 #

I'm pretty sure she knows already?

47. VernorVintage ◴[01 Apr 23 19:55 UTC] No.35403525{3}[source]▶

>>35397974 #

It's a pretty big concern if you had to spend a billion on training, but 6 months later the open source community is able to replicate your training for <100K because you were too cheap to hire an extra 100 optimization experts

That'd be a 10,000 fold depreciation of an asset due to a preventable oversight. Ouchies.

48. kmeisthax ◴[01 Apr 23 20:42 UTC] No.35403868{6}[source]▶

>>35398659 #

Yes, that's the "strategic" play I mentioned before.

This isn't really helpful for people who want open AI though, because if your strategy is to deny OpenAI data and knowledge then you aren't going to release any models either.

49. breck ◴[01 Apr 23 21:37 UTC] No.35404345{5}[source]▶

>>35401622 #

> Abolishing the copyright clause would not solve this problem because OpenAI is not leveraging copyright or patents. They're just not releasing anything.

The problem is how in the world is ChatGPT so good compared to the average human being? The answer is that human beings (except for the 1%), have their left hands tied behind their back because of copyright law.

50. ◴[02 Apr 23 02:40 UTC] No.35406593{4}[source]▶

>>35400271 #

51. datadeft ◴[02 Apr 23 06:52 UTC] No.35407867{3}[source]▶

>>35393655 #

I was thinking about the limitation by the mobile HW. Yeah GPU support would be nice.

52. datadeft ◴[02 Apr 23 06:52 UTC] No.35407869{3}[source]▶

>>35393662 #

This is what my comment is about. It happened much sooner than I thought.

53. Max-Limelihood ◴[03 Apr 23 22:57 UTC] No.35432517{4}[source]▶

>>35395401 #

In ML? No, the best competition for Python in ML is... well, it's either C++ or Julia, depending on how you define "competition," given Python is effectively a glorified C++ interface.

54. Max-Limelihood ◴[04 Apr 23 16:42 UTC] No.35442394{3}[source]▶

>>35394897 #

It really is that slow.

You're completely correct that the speed-sensitive parts are written in lower-level libraries, but another way to phrase that is "Python can go really fast, as long as you don't use Python." But this also means ML is effectively hamstrung into only using methods that already exist and have been coded in C++, since anything in Python would be too slow to compete.

There's lots of languages that make good tradeoffs between performance and usability. Python is not one of those languages. It is, at best, only slightly harder to use than Julia, yet orders-of-magnitude slower.

55. Robotbeat ◴[14 Apr 23 19:04 UTC] No.35573426[source]▶

>>35393921 #

Training via CPU isn’t that bad if fully optimized with AVX512 extensions.