And does that include Alpaca models like this? https://huggingface.co/elinas/alpaca-30b-lora-int4
And does that include Alpaca models like this? https://huggingface.co/elinas/alpaca-30b-lora-int4
If you want to run larger Alpaca models on a low VRAM GPU, try FlexGen. I think https://github.com/oobabooga/text-generation-webui/ is one of the easier ways to get that going.
Ok I answered my own question.
I would be very curious about some contrastive benchmarks between a quantized and non-quantized version of the same model.
tl;dr quantizing the 13B model gives up about 30% of the improvement you get from moving from 7B to 13B - so quantized 13B is still much better than unquantized 7B. Similar results for the larger models.
Perplexity - model options
5.5985 - 13B, q4_0
5.9565 - 7B, f16
6.3001 - 7B, q4_1
6.5949 - 7B, q4_0
6.5995 - 7B, q4_0, --memory_f16
According to this repo[1] difference is about 3% in their implementation with right group size. If you'd like to know more, I think you should read GPTQ paper[2].
An anonymous HN user named L pledged $200k for llama-dl’s legal defense: https://twitter.com/theshawwn/status/1641804013791215619?s=6...
This may not seem like much vs Meta, but it’s enough to get the issue into the court system where it can be settled. The tweet chain has the details.
The takeaway for you is that you’ll soon be able to use LLaMA without worrying that Facebook will knock you offline for it. (I wouldn’t push your luck by trying to use it for commercial purposes though.)
Past discussion: https://news.ycombinator.com/item?id=35288415
I’d also like to take this opportunity to thank all of the researchers at MetaAI for their tremendous work. It’s because of them that we have access to such a wonderful model in the first place. They have no say over the legal side of things. One day we’ll all come together again, and this will just be a small speedbump in the rear view mirror.
EDIT: Please do me a favor and skip ahead to this comment: https://news.ycombinator.com/item?id=35393615
It's from jart, the author of the PR the submission points to. I really had no idea that this was a de facto Show HN, and it's terribly rude to post my comment in that context. I only meant to reassure everyone that they can freely hack on llama, not make a huge splash and detract from their moment on HN. (I feel awful about that; it's wonderful to be featured on HN, and no one should have to share their spotlight when it's a Show HN. Apologies.)
I’m grateful for the opportunity to help protect open source projects such as this one. It will at least give Huggingface a basis to resist DMCAs in the short term.
I dunno why I thought llama.cpp would support gpus. shrug
For the moment, I’m just happy to disarm corporations from using DMCAs against open source projects. The long term implications will be interesting.
In other words, the groups of folks working on training models don’t necessarily have access to the sort of optimization engineers that are working in other areas.
When all of this leaked into the open, it caused a lot of people knowledgeable in different areas to put their own expertise to the task. Some of those efforts (mmap) pay off spectacularly. Expect industry to copy the best of these improvements.
There are some benchmarks here: https://www.reddit.com/r/LocalLLaMA/comments/1248183/i_am_cu... and here: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-i...
Check out the original paper on quantization, which has some benchmarks: https://arxiv.org/pdf/2210.17323.pdf and this paper, which also has benchmarks and explains how they determined that 4-bit quantization is optimal compared to 3-bit: https://arxiv.org/pdf/2212.09720.pdf
I also think the discussion of that second paper here is interesting, though it doesn't have its own benchmarks: https://github.com/oobabooga/text-generation-webui/issues/17...
So it's important to note that all of these improvements are the kinds of things that are cheap to run on a pretrained model. And all of the developments involving large language models recently have been the product of hundreds of thousands of dollars in rented compute time. Once you start putting six digits on a pile of model weights, that becomes a capital cost that the business either needs to recuperate or turn into a competitive advantage. So everyone who scales up to this point doesn't release model weights.
The model in question - LLaMA - isn't even a public model. It leaked and people copied[0] it. But because such a large model leaked, now people can actually work on iterative improvements again.
Unfortunately we don't really have a way for the FOSS community to pool together that much money to buy compute from cloud providers. Contributions-in-kind through distributed computing (e.g. a "GPT@home" project) would require significant changes to training methodology[1]. Further compounding this, the state-of-the-art is actually kind of a trade secret now. Exact training code isn't always available, and OpenAI has even gone so far as to refuse to say anything about GPT-4's architecture or training set to prevent open replication.
[0] I'm avoiding the use of the verb "stole" here, not just because I support filesharing, but because copyright law likely does not protect AI model weights alone.
[1] AI training has very high minimum requirements to get in the door. If your GPU has 12GB of VRAM and your model and gradients require 13GB, you can't train the model. CPUs don't have this limitation but they are ridiculously inefficient for any training task. There are techniques like ZeRO to give pagefile-like state partitioning to GPU training, but that requires additional engineering.
You can't if you have one 12gb gpu. You can if you have couple of dozens. And then petals-style training is possible. It is all very very new and there are many unsolved hurdles, but I think it can be done.
I played with Pi3141/alpaca-lora-7B-ggml two days ago and it was super disappointing. In percentage between 0% = alpaca-lora-7B-ggml and 100% GPT-3.5, where would LLaMA 30B be positioned?
I've been able to compare 4 bit GPTQ, naive int8, LLM.int8, fp16, and fp32. LLM.int8 does impressively well but inference is 4-5x slower than native fp16.
Oddly I recently ran a fork of the model on the ONNX runtime, I'm convinced that the model performed better than pytorch/transformers, perhaps subtle differences in floating point behavior etc between kernels on different hardware significantly influence performance.
The most promising next step in the quantization space IMO has to be fp8, there's a lot of hardware vendors adding support, and there's a lot of reasons to believe fp8 will outperform most current quantization schemes [1][2]. Particularly when combined with quantization aware training / fine tuning (I think OpenAI did something similar for GPT3.5 "turbo").
If anybody is interested I'm currently working on an open source fp8 emulation library for pytorch, hoping to build something equivalent to bitsandbytes. If you are interested in collaborating my email is in my profile.
1. https://arxiv.org/abs/2208.09225 2. https://arxiv.org/abs/2209.05433
>As it is their outputs are not copyrightable, it’s not a stretch to say models are public domain.
With all respect this is kind of nonsensical. "Public domain" only applies to stuff that is copyrightable, if they simply aren't then it just never enters into the picture. And it not being patentable or copyrightable doesn't mean there is any requirement to share it. If it does get out though then that's mostly their own problem is all (though depending on jurisdiction and contract whoever did the leaking might get in trouble), and anyone else is free to figure it out on their own and share that and they can't do anything.
From my understanding of the issue, mmap'ing the file is showing that inference is only accessing a fraction of the weight data.
Doesn't the forward pass necessitate accessing all the weights and not a fraction of them?
My guess would be that the model is faulted into memory lazily page by page (4K or 16K chunks) as the model is used, so only the actual parts that are needed are loaded.
The kernel also removes old pages from the page cache to make room for new ones, and especially so if the computer is using a lot of its RAM. As with all performance things, this approach trades off inference speed for memory usage, but likely faster overall because you don't have to read the entire thing from disk at the start. Each input will take a different path through the model, and will require loading more of it.
The cool part is that this memory architecture should work just fine with hardware acceleration, too, as long as the computer has unified memory (anything with an integrated GPU). This approach likely won't be possible with dedicated GPUs/VRAM.
This approach _does_ still work to run a dense model with limited memory, but the time/memory savings would just be less. The GPU doesn't multiply every matrix in the file literally simultaneously, so the page cache doesn't need to contain the entire model at once.
Turnabout is fair play. I don't feel the least bit sorry for Meta.
As you may be aware, a counter-notice that meets the statutory requirements will result in reinstatement unless Meta sues over it. So the question isn't so much whether your counter-notice covers all the potential defenses as whether Meta is willing to sue.
The primary hurdle you're going to face is your argument that weights are not creative works, and not copyrightable. That argument is unlikely to succeed for the the following reasons (just off the top of my head): (i) The act of selecting training data is more akin to an encyclopedia than the white pages example you used on Twitter, and encyclopedias are copyrightable as to the arrangement and specific descriptions of facts, even though the underlying facts are not; and (ii) LLaMA, GPT-N, Bard, etc, all have different weights, different numbers of parameters, different amounts of training data, and different tuning, which puts paid to the idea that there is only one way to express the underlying ideas, or that all of it is necessarily controlled by the specific math involved.
In addition, Meta has the financial wherewithal to crush you even were you legally on sound footing.
The upshot of all of this is that you may win for now if Meta doesn't want to file a rush lawsuit, but in the long run, you likely lose.
The only reason I posted it is because Facebook had been DMCAing a few repos, and I wanted to reassure everyone that they can hack freely without worry. That’s all.
I’m really sorry if I overshadowed your moment on HN, and I feel terrible about that. I’ll try to read the room a little better before posting from now on.
Please have a wonderful weekend, and thanks so much for your hard work on LLaMA!
EDIT: The mods have mercifully downweighted my comment, which is a relief. Thank you for speaking up about that, and sorry again.
If you'd like to discuss any of the topics you originally posted about, you had some great points.
Start time was also a huge issue with building anything usable, so I'm glad to see that being worked on. There's potential here, but I'm still waiting on more direct API/calling access. Context size is also a little bit of a problem. I think categorization is a potentially great use, but without additional alignment training and with the context size fairly low, I had trouble figuring out where I could make use of tagging/summarizing.
So in general, as it stands I had a lot of trouble figuring out what I could personally build with this that would be genuinely useful to run locally and where it wouldn't be preferable to build a separate tool that didn't use AI at all. But I'm very excited to see it continue to get optimized; I think locally running models are very important right now.
Seeing such clever use of mmap makes me dread to imagine how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions.
Perhaps SWE is dead after all, but LLMs didn't kill it...
https://www.usatoday.com/story/tech/2022/09/22/facebook-meta...
[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/934034c8e...
[2] https://github.com/ggerganov/llama.cpp/tree/3525899277d2e2bd...
I have the M2 air and can't wait until further optimisation with the Neural Engine / multicore gpu + shared ram etc.
I find it absolutely mind boggling that GPT-3.5(4?) level quality may be within reach locally on my $1500 laptop / $800 m2 mini.
But OTOH, by preventing commercial use, they have sparked the creation of an open source ecosystem where people are building on top of it because it's fun, not because they want to build a moat to fill it with sweet VC $$$money.
It's great to see that ecosystem being built around it, and soon someone will train a fully open source model to replace Llama
I don't consider it ethical to compress the corpus of human knowledge into some NN weights and then closing those weights behind proprietary doors, and I hope that legislators will see this similarly.
My only worry is that they'll get you on some technicality, like that (some version of) your program used their servers afaik.
There must be open source projects with enough money to pool into such a project. I wonder whether wikimedia or apache are considering anything.
> LLaMA 30B appears to be a sparse model. While there's 20GB of weights, depending on your prompt I suppose only a small portion of that needs to be used at evaluation time [...]
Found the answer from the author of this amazing pull request: https://github.com/ggerganov/llama.cpp/discussions/638#discu...
This is one of the most common talking points I see brought up, especially when defending things like ai "learning" from the style of artists and then being able to replicate that style. On the surface we can say, oh it's similar to a human learning from an art style and replicating it. But that implies that the program is functioning like a human mind (as far as I know the jury is still out on that and I doubt we know exactly how a human mind actually "learns" (I'm not a neuroscientist)).
Let's say for the sake of experiment I ask you to cut out every word of pride and prejudice, and keep them all sorted. Then when asked to write a story in the style of jane austen you pull from that pile of snipped out words and arranged them in a pattern that most resembles her writing, did you transform it? Sure maybe, if a human did that I bet they could even copyright it, but I think that as a machine, it took those words, phrases, and applied an algorithm to generating output, even with stochastic elements the direct backwards traceability albeit a 65B convolution of it means that the essence of the copyrighted materials has been directly translated.
From what I can see we can't prove the human mind is strictly deterministic. But an ai very well might be in many senses. So the transference of non-deterministic material (the original) through a deterministic transform has to root back to the non-deterministic model (the human mind and therefore the original copyright holder).
It’s several things:
* Cutting-edge code, not overly concerned with optimization
* Code written by scientists, who aren’t known for being the world’s greatest programmers
* The obsession the research world has with using Python
Not surprising that there’s a lot of low-hanging fruit that can be optimized.
1. How does this compare with ChatGPT3
2. Does it mean we could eventually run a system such as ChatGPT3 on a computer
3. Could LLM eventually replace Google (in the sense that answers could be correct 99.9% of the time) or is the tech inherently flawed
Edit: looks like there's now confirmation that running it on a 10GB VM slows inference down massively, so looks like the only thing strange is the memory usage reading on some systems.
Did that metric meaningfully change when the amount of required memory dropped?
If the amount of diversity is lowered, I would expect that to lower the amount of patterns to be modeled from the text. If that is the case, then the resulting model size itself would be lowered, during and after training.
Probably not all that much. All of the Python numeric computing frameworks (Numpy, PyTorch, TensorFlow, etc.) are basically just wrappers for lower level C++/C/Fortran code. Unless you’re doing something boneheaded and converting framework-native tensors to Python objects, passing tensors around within a framework essentially just passes a pointer around, which has marginal overhead even when encapsulated in a bloated Python object.
Indeed, a huge number of PyTorch operations are explicitly zero copy: https://pytorch.org/docs/stable/tensor_view.html
Why waste developer hours (meaning effort) if you can just scale the infra for a little cash? Do it in small enough increments and the increases only outweigh FTEs if you consider all scaling events and look at a long enough time scale.
Suddenly it takes way too much for way too little, but it cost half as many overpaid developers who can’t be arsed to performance.
Edit: in case that sounds like the opposite of intended, ggerganov and jart are the outliers, the exception.
Thank you for the amazing work. It’s so appreciated by so many on HN like me I’m sure.
This is totally the right way. Make it work, then make it right, then make it fast.
If the LLM is a specific arrangement of the copyrighted works, it's very clearly a derivative work of them
H_s(x) := -\sum_{x \in X_s} p(x) log(p(x))
where X_s := all s-grams from the training set? That seems like it would eventually become hard to impossible to actually compute. Even if you could what would it tell you?Or, wait... are you referring to running such an analysis on the output of the model? Yeah, that might prove interesting....
How is that possible? Is the model being compressed even more (even after converting to 4 bit) somehow? Or is most of the model unused?
How so? Why couldn't we just start a gofundme/kickstarter to fund the training of an open-source model?
"Oh, you know, like 12-18GB"
"Haha shut the fuck up, how much RAM did you shave off last week"
"12-18GB"
"Let me tell you what - you show me your commits right now, if you shaved off 12-18GB of RAM last week I quit my job right now and come work for you"
It also shows the number of impostors in this thread and inflated titles of self-proclaimed 'seniors' who can't optimize ML code to even be on the same league as Tunney (jart), and Gerganov (ggerganov).
Not even ChatGPT or Copilot could even submit a change or in-fact completely rewrite and optimize this code like they have done.
I think that's just an accounting thing. Many UNIX variants will not "charge" read only memory mapped pages to a process, because they could be shared among many processes and evicted at will.
this is why i think the patent and copyright system is a failure. The idea that having laws protecting information like this would advance the progress of science.
It doesn't, because look how an illegally leaked model gets much more advances in shorter time. The laws protecting IP merely gives a moat to incumbents.
> This formulation of this statement has been attributed to [KentBeck][0]; it has existed as part of the [UnixWay][1] for a long time.
If you try to use LLMs as a Google replacement you're going to run into problems pretty quick.
LLMs are better thought of as "calculators for words" - retrieval of facts is a by-product of how they are trained, but it's not their core competence at all.
LLaMA at 4bit on my laptop is around 3.9GB. There's no way you could compress all of human knowledge into less than 4GB of space. Even ChatGPT / GPT-4, though much bigger, couldn't possible contain all of the information that you might want them to contain.
https://www.newyorker.com/tech/annals-of-technology/chatgpt-... "ChatGPT Is a Blurry JPEG of the Web" is a neat way of thinking about that.
But... it turns out you don't actually need a single LLM that contains all knowledge. What's much more interesting is a smaller LLM that has the ability to run tools - such as executing searches against larger indexes of data. That's what Bing and Google Bard do already, and it's a pattern we can implement ourselves pretty easily: https://til.simonwillison.net/llms/python-react-pattern
The thing that excites me is the idea of having a 4GB (or 8GB or 16GB even) model on my own computer that has enough capabilities that it can operate as a personal agent, running searches, executing calculations and generally doing really useful stuff despite not containing a great deal of detailed knowledge about the world at all.
Sounds like the big win is load time from the optimizations. Also, maybe llama.cpp now supports low-memory systems through mmap swapping? ... at the end of the day, 30B quantized is still 19GB...
The last part (3) that it rereads the whole file again is an assumption and it could just be a coincidence that the new token is computed at every ~20GB read from disk, but it makes sense, as I do not think swapping would have been that inefficient.
It's not unfeasible, in fact that's how things were done before lots of improvements to the various libraries in essence, many corps still have poorly built pipelines that spend a lot of time in CPU land and not enough in GPU land.
Just an FYI as well - intermediate outputs of models are used in quite a bit of ML, you may see them in some form being used for hyperparameter optimization and searching.
It's not the easiest syntax, not the best compiler support, performance and threading is a joke. The entire language is based on hype back from the time when the only two mainstream languages were C++ and Java.
Someone in the GitHub comments had the same experience when using a 10GB VM to limit memory usage.
It appears the claims of memory reduction were premature. Perhaps an artifact of how memory usage is being reported by some tools.
You could potentially crowdfund this, though I should point out that this was already tried and Kickstarter shut it down. The effort in question, "Unstable Diffusion", was kinda sketchy, promising a model specifically tuned for NSFW work. What you'd want is an organization that's responsible, knows how to use state of the art model architectures, and at least is willing to try and stop generative porn.
Which just so happens to be Stability AI. Except they're funded as a for-profit on venture capital, not as a something you can donate to on Kickstarter or Patreon.
If they were to switch from investor subsidy to crowdfunding, however, I'm not entirely sure people would actually be lining up to bear the costs of training. To find out why we need to talk about motive. We can broadly subdivide the users of generative AI into a few categories:
- Companies, who view AI as a way to either juice stock prices by promising a permanent capitalist revolution that will abolish the creative working class. They do not care about ownership, they care about balancing profit and loss. Insamuch as they want AI models not controlled by OpenAI, it is a strategic play, not a moral one.
- Artists of varying degrees of competence who use generative AI to skip past creative busywork such as assembling references or to hack out something quickly. Insamuch as they have critiques of how AI is owned, it is specifically that they do not want to be abolished by capitalists using their own labor as ground meat for the linear algebra data blender. So they are unlikely to crowdfund the thing they are angry is going to put them out of a job.
- No-hopers and other creatively bankrupt individuals who have been sold a promise that AI is going to fix their lack of talent by making talent obsolete. This is, of course, a lie[2]. They absolutely would prefer a model unencumbered by filters on cloud servers or morality clauses in licensing agreements, but they do not have the capital in aggregate to fund such an endeavor.
- Free Software types that hate OpenAI's about-face on open AI. Oddly enough, they also have the same hangups artists do, because much of FOSS is based on copyleft/Share-Alike clauses in the GPL, which things like GitHub Copilot is not equipped to handle. On the other hand they probably would be OK with it if the model was trained on permissive sources and had some kind of regurgitation detector. Consider this one a wildcard.
- Evildoers. This could be people who want a cheaper version of GPT-4 that hasn't been Asimov'd by OpenAI so they can generate shittons of spam. Or people who want a Stable Diffusion model that's really good at making nonconsensual deepfake pornography so they can fuck with people's heads. This was the explicit demographic that "Unstable Diffusion" was trying to target. Problem is, cybercriminals tend to be fairly unsophisticated, because the people who actually know how to crime with impunity would rather make more money in legitimate business instead.
Out of five demographics I'm aware of, two have capital but no motive, two have motive but no capital, and one would have both - but they already have a sour taste in their mouth from the creep-tech vibes that AI gives off.
[0] In practice the only way that profit cap is being hit is if they upend the economy so much that it completely decimates all human labor, in which case they can just overthrow the government and start sending out Terminators to kill the working class[1].
[1] God damn it why do all the best novel ideas have to come by when I'm halfway through another fucking rewrite of my current one
[2] Getting generative AI to spit out good writing or art requires careful knowledge of the model's strengths and limitations. Like any good tool.
This leads to all sorts of bugs and breaking changes that are cool in an academic or hobbyist setting but a total headache on a large production system.
But a lot of people would rather only have govt or corp control of it...
A long story short, in the future the AI can just convert all our code to FORTH or HolyC or some "creative" combination of languages chosen by prophecy (read: hallucination) perhaps even Python — as a show of strength.
zero_buffer (CFFI, 2013) https://github.com/alex/zero_buffer/blob/master/zero_buffer....
"Buffers on the edge: Python and Rust" (2022) https://alexgaynor.net/2022/oct/23/buffers-on-the-edge/ :
> If you have a Python object and want to obtain its buffer, you can do so with memoryview in Python or PyObject_GetBuffer in C. If you’re defining a class and want to expose a buffer, you can do so in Python by… actually you can’t, only classes implemented in C can implement the buffer protocol. To implement the buffer protocol in C, you provide the bf_getbuffer and bf_releasebuffer functions which are called to obtain a buffer from an object and when that buffer is being released, respectively.
iocursor (CPython C API, ~Rust std::io::Cursor) https://github.com/althonos/iocursor
Arrow Python (C++) > On disk and MemoryMappedFile s: https://arrow.apache.org/docs/python/memory.html#on-disk-and...
"Apache Arrow: Read DataFrame With Zero Memory" (2020) https://towardsdatascience.com/apache-arrow-read-dataframe-w...
pyarrow.Tensor: https://arrow.apache.org/docs/python/generated/pyarrow.Tenso...
ONNX is built on protocolbuffers/protobufs (google/protobufs), while Arrow is built on google/flatbuffers.
FlatBuffers https://en.wikipedia.org/wiki/FlatBuffers :
> It supports “zero-copy” deserialization, so that accessing the serialized data does not require first copying it into a separate part of memory. This makes accessing data in these formats much faster than data in formats requiring more extensive processing, such as JSON, CSV, and in many cases Protocol Buffers. Compared to other serialization formats however, the handling of FlatBuffers requires usually more code, and some operations are not possible (like some mutation operations).
Yes. When you have to try out dozens of research ideas, most of which won't pan out, then you stop writing engineering-style code and switch to hacker mode. Why make it nice when you could be trying 2 more ideas in the meantime. Most of research code it is going to the trash anyway.
However, to address your point about derivative works directly, the consensus among copyright law experts appears to be that whether a particular model output is infringing depends on the standard copyright infringement analysis (and that’s regardless of the minor and correctable issue represented by memorization/overfitting of duplicate data in training sets). Only in the most unserious legal complaint (the class action filed against Midjourney, Stability AI, etc.) is the argument being made and that the models actually contain copies of the training data.
It's the easiest among most popular languages. It uses the least amount of symbols, parenthesis and braces only for values.
Some people don't like the significant whitespace, but that helps readability.
Pull requests and stars on github? That might be a start.
https://madnight.github.io/githut/#/pull_requests/2022/4 https://madnight.github.io/githut/#/stars/2022/4
Though you may say but but alltheprivaterepos! Then I challenge you to back up what you mean by relevance and prove python is a category of relevant 15+ years ago.
It's not like the zero copy buzzword is going to help you during training, all your weights have to stay on GPU, you are going to sample your training data randomly and your data is on a networked storage anyway, so mmap HURTS. You'd better just O_DIRECT.
Similarly, as long as you run your inference on GPU it's not like you can mmap... And I have indeed worked on inference runtimes for mobile devices and on the rare cases we need to use CPU only (hey, your phone also have GPUs since forever) at $PREVIOUS_JOB we did have a mmap-able model format, it also helps in TEE/SGX/whatever enclave tech. Oh, and there are no Python at all.
The recent development of ggml is interesting as it catches a moment that "big ML shop infra" guys don't care: running models on Apple Silicon. M1/M2s are expensive enough that we don't consider deploying them instead of those 1000000000 bizarre accelerators in production, yet everyone on HN seems to have one and hey it's fast enough for LMs. They are rather unique as they are CPU+high bandwidth RAM+accelerators with totally shared RAM with CPU, instead of some GPU shit.
tldr it's not like "big ML shop infra" guys are stupid and leaves performance on table. They just don't run their production workload on MacBooks. That's where the community shine right?
In fact I'd love to see that Transformer really dominates. We can then start to converge on software. And compute-wise transformers are really simple, too!
The interface is designed to be easy to use (python) and the bit that is actually doing the work is designed to be heavily performant (which is C & CUDA and may even be running on a TPU).
It doesn’t excel at anything, but anything a software can do, it can be done in Python somehow.
So, a great pick when you’ve got no idea where you’re going to, when you’re prototyping, when you don’t care about performance or perfection.
I agree that for large scale systems when you already know what you’re doing, Python shows its limits quite soon (and we should add the problems with missing/slow type checking that slows down large scale systems development).
- Can it run doom
- Inference of LLaMA model in pure C/C++
- Plain C/C++ implementation without dependencies
It really does not explain itsef to the uniniated. I infer it is some kind of language model.
Why/how it differs any other impl/model, i do not know.
Huh? Why?
You can barely deploy it to Web.
it doesn't scale perfoance wise
you can't built robust abstractions
The REPL is merely OK
You can barely ship working code without containers
the syntax is hard to manipulate programmatically
Python has inertia but it's holding us back
is there any evidence that this makes it easier?
people learn python as beginners because it has a reputation for being easy for beginners
I don't see anything about the syntax that makes it inherently easier
Of course it would save them some money if they could run their models on cheaper hardware, but they've raised $11B so I don't think that's much of a concern right now. Better to spend the efforts on pushing the model forward, which some of these optimisations may make harder.
It sounds unnecessarily weird to me that people would share Python code that simply doesn't work out at all out of the box.
Compared to what? Unindented or badly indented code in other languages?
In other languages you can move code around and it still works - and nobody prevents you from adding whitespace for readeability (it may be even done automatically for you).
If there was a superior alternative that covers the breadth of the Python ecosystem I’m pretty sure no one would have any scruples in using it. A programming language and its syntax is the least interesting or complex part when it comes to solving problems. Just rattling off some amazing libraries I've used over the last few years:
https://scikit-image.org - Image processing
https://imgaug.readthedocs.io - Image augmentation
https://scikit-learn.org/stable - ML
https://pymoo.org - Multi objective optimization
https://simpy.readthedocs.io/ - Discrete event simulation
https://lifelines.readthedocs.io - Survival analysis
https://bambinos.github.io/bambi - Bayesian modeling
https://unit8co.github.io/darts/ - Time series forecasting
https://abydos.readthedocs.io/en/latest/abydos.distance.html - Basically any string distance metric you can think of
The list just goes on and on.. oh yeah, some Deep Learning libraries too, which some people find useful.
Straight int8 quantization generally does not work for post training quantization of transformers. The distribution of weights includes a significant amount of outlier values that seem to be important to model performance. Apparently quantization aware training can improve things significantly but I haven't seen any developments for llama yet.
Interestingly on the 4 bit front, NVIDIA has chosen to remove int4 support from the next gen Hopper series. I'm not sure folks realize the industry has already moved on. FP8 feels like a bit of a hack, but I like it!
Having said that, I've deployed two large Django projects on the web with tons of customers and it runs and scales just fine, and it's a DREAM to maintain and develop for than for example Java.. I would go so far as to say the opposite, if you haven't used Python for web deployment you've been missing out! (you lose some efficiency I'm sure but you gain other things)
> mmap-ed memory pages backed by a file that aren't dirty aren't counted in an process's RSS usage, only kernel page cache. The mmap-ed regions of virtual memory does get counted in VSZ (virtual memory) but that is just virtual and can be larger than RAM+swap.
I liked the one way of doing most things philosophy, coming off working on a large C++ code base.
https://huggingface.co/Pi3141/alpaca-lora-30B-ggml/tree/main
If you ever come up with more hypothetical arguments in favor of NNs being copyrightable, please let me know. Or post them somewhere.
Though in practice, in many cases, mmap won't be faster, it can be even slower than open+read.
Just wanna say, that this use of mmap() is cleverly used in this context, but should be acknowledged as a widely accepted industry standard practice for getting higher performance, particularly in embedded applications but also in performance-oriented apps such as digital audio workstations, video editing systems, and so on.
Tragedy of folks forgetting how to program.
This mmap() "trick" isn't a trick, its a standard practice for anyone who has cut their teeth on POSIX or embedded. See also mlock()/munlock() ..
The trope about it being the 2nd best language for everything isn't correct. It's taught in universities because it has a very short time to gratification, and the basic syntax is quite intuitive. Academics latched onto it for ML because of some excellent libraries, and it became established as a vital part of the ecosystem from there.
But it's a nightmare to support a moderate to large codebase in production, packaging continues to be a mess, and it's full of weird quirks. Great for weekend projects, but for pete's sake take a minute and port them into something more reliable before going to production with them.
Sure, but that is the gun, especially (as reflected in your examples) for machine learning. The best frameworks (PyTorch, TensorFlow, JAX) are all Python, with support for other languages being an afterthought as best.
The use of scripting languages (Python, Lua - original Torch) for ML seems to have started partly because he original users were non-developers, more from a math/stats background, and partly because an interactive REPL loop is good for a field like this that is very experimental/empirical.
Does it make sense that we're now building AGI using a scripting language? Not really, but that's where we are!
Another suggestion is that not all of the word/token embedding table might be used, which would be a function of the input used to test, but that would be easy enough to disprove as there would then be different memory usage for different inputs.
It seems possible the reported memory usage is lower than reality if that's how mmap/top work. In any case, a good use of mmap it seems, especially since for a multi-layer model layer weights will be used sequentially so paged load-on-demand will work relatively well even in a low memory situation.
Never understood why people think that indented languages are any simpler when in fact they bring all kinds of trouble for getting things done.
but to your point, until technology itself actually replaces us, deeply skilled computer people are always going to be able to squeeze more performance out of software implemented in high level languages by those who have not studied computers extensively.
A 32bit address space only means you have 4GibiAddresses, which do not need to be pointing to single bytes. In fact the natural thing to do in a 32 bit system for a structure like this is moving 32bit words, which actually means you're addressing a 16GB space, flat. And then there's segmentation.
For instance the 286 had a 24bit address showing for 16MB in direct addressing mode and 1GB via segmentation (what back then was usually referred to by virtual memory)
The 386 had a 32 bit address width and its MMU allowed access to 64TB in virtual mode and 4GB in protected mode.This was indeed one of the reasons Linux was not made 286-compatible. Its protected mode was only 1GB and segmented rather than 4GB flat, so Linus didn't have to deal with XMS or EMS for a chip that was becoming obsolete soon anyway. But the 1GB space was there, and at the time that was plenty.
Python is more readable than C. Way better than C++. Far simpler to reason about than Java. Maybe Typescript is on a similar level, but throwing a beginner into the JS ecosystem can be daunting. Perhaps Ruby could be argued as equally simple, but it feels like that's a dead end language these days. Golang is great, but probably not as easy to get rolling with as Python.
What else? Are you going to recommend some niche language no one hires for?
those are the output of convert-pth-to-ggml.py and quantize respectively
I had to cancel 30B as I needed to use the computer after some 12 hours, now I have to fix the ext4 filesystem of the drive where I was doing it, fun times for the weekend
guess I'll settle for 13B, I was using 7B but the results are pretty lousy compared to GPT4all's Lora, let alone GPT3.5-turbo or better
I'll give a shot to quantising 13B, I'm on 16GB of RAM locally
The square brackets alone make it a winner. Array, list and strings indexing. Dictionary lookups. Slices and substrings. List comprehensions. The notations convenience of this alone is immense.
Built in list, string, and dicts. For the 90% of code that is not performance critical, this is a godsend. Just looking at the c++ syntax for this makes me never want to use a stl data structure for anything trivial.
What people sometimes fail to understand is that code is a mean to an end, not an end in itself.
If you want to make code for itself, work on an opensource and/or personal project. If you are paid to work on something, you're paid for the something to get out, not for it to feature the best code ever.
Not sure how neutral or what benchmarks are used on the following link, but T5 seems to sit a lot higher on this leaderboard?
The peculiarity here is that tools like htop were reporting the improvement as being an 8x improvement, which is interesting, because RAM use is only 2x better due to my change. The rusage.com page fault reporting was also interesting too. This is not due to sparseness. It's because htop was subtracting MAP_SHARED memory. The htop docs say on my computer that the color purple is used to display shared memory, and yellow is used to display kernel file caches. But it turned out it just uses yellow for both, even though it shouldn't, because mincore() reported that the shared memory had been loaded into the resident set size.
Yes. These laws are bad. We could fix this with a 2 line change:
Section 1. Article I, Section 8, Clause 8 of this Constitution is hereby repealed.
Section 2. Congress shall make no law abridging the right of the people to publish information.
Or hiring useless business people to install around the periphery of engineering. Which is funny because now tech is letting all those folks go.
Ocaml is very niche, I feel it’s an hard sell for a general purpose language. Haskell, 3x that.
JS and TS, could be. But are they so much better than Python, if better at all?
If you just write straight c++ (without c++xx, or anything like it) you can compile the code on machines from decades ago if you want.
In the end they are mathematical models, so what would prevent someone from loading T5 into a machine with plenty of RAM (like a server)? Would the codebase truly require that much refactoring? How difficult would it be to rewrite the model arhitecture as a set of mathematical equations (Einstein summation) and reimplement inference for CPU?
Anyway, T5 being available for download from Huggingface only makes my question more pertinent...
To fix this, you'd need to ban trade secrecy entirely. As in, if you have some kind of invention or creative work you must publish sufficient information to replicate it "in a timely manner". This would be one of those absolutely insane schemes that only a villain in an Ayn Rand book would come up with.
What's happened to the popularity of all of these languages since 2010? Outside of JS/TS, absolutely nothing. If anything, they've lost mindshare.
You could run notebooks entirely client side https://jupyterlite.readthedocs.io/en/latest/
The startup is slow but otherwise it is pretty functional.
https://www.youtube.com/watch?v=coIj2CU5LMU
Would this version (ggerganov) work with one of those methods?
.NET has P/Invoke which is much nicer.
JVM is getting Panama+jextract, which is the nicest yet. You can go straight from header files to pure Java bindings which don't need any extra native code at all. But it's not shipped yet :(
It looks to me that if I was planning on building a new machine capable of LLM inference it’s going to be possible using commodity gamer components and if lazy weights is viable, then such a machine with multiple PCIe 5 nvme drives in a raid 0 can potentially almost reach memory bandwidth.
On my list of to investigate next is in regards to inference with GPUs, could somehow multiple smaller GPUs be used with a technique similar to the OP post.
Strong disagreement. Explicit types make reasoning about Java much easier, especially when you are in an unfamiliar codebase.
Python is not quite the 'write-only' language of Perl, but it is a lot easier to write it than it is to read it.
This isn't just a matter of making the 30B model run in 6GB or whatever. You can now run the largest model, without heavy quantization, and let the OS figure it out. It won't be as fast as having "enough" memory, but it will run.
In theory you could always have done this with swap, but swap is even slower because evictions have to be written back to swap (and wear out your SSD if your swap isn't on glacially slow spinning rust) instead of just discarded because the OS knows where to read it back from the filesystem.
This should also make it much more efficient to run multiple instances at once because they can share the block cache.
(I wonder if anybody has done this with Stable Diffusion etc.)
Even if it doesn't have the best syntax now (which I doubt), the tooling and libraries make it a better choice over any language that have an edge over python syntax.
You can easily build a standalone binary (well, it would be GiB+ if you use CUDA... but that's the cost of statically linking cu*), had you coded your model and training loop in C++.
It then happily runs everywhere as long as a NVIDIA GPU driver is available (don't need to install CUDA).
Protip: Your AI research team REALLY DON'T WANT TO DO THIS BECAUSE THEY LOVE PYTHON. Having Python, even with the dependency management shit, is a feature, not a bug.
(if you want Rust / Go and don't want to wrapping libtorch/tf then you have a lot of work to do but yeah it's possible. also there are model compiler guys [1] where the promise is model.py in model.o out you just link it with your code)
[1] https://mlc.ai
You’d do better complaining about still nascent, compared to alternatives, async support or lack of jit in the official implementation.
The forced use of spacing to delineate blocks means you will never see a bunch of brackets eating up screen space and the common error where someone adds another line to an if statement but doesn't add braces.
Semicolons not being conventional means less screen noise and less code golf 1 liners.
The focus on imperative vs functional means you rarely ever see something like this a(b(c(d(e(f(g))))
PHP suffers greatly from poorly named standard functions on top of all of that.
Don't get me started on Ruby metaprogramming.
These are just the things I could think of off the top of my head. I do not want to spend my afternoon on this. This is just my experience looking at code for over 20 years, you either believe it or you don't. There's no scientific studies to prove that 1 syntax feature is superior.
I highly doubt that everyone chose python just because Google did. Python was a giant step in syntax compared to the competition back then, and now even if there is a new language out there right now that has a better syntax, it's not going to be better by much, and it is not going to have the tooling, libraries, or the community.
That'd be a 10,000 fold depreciation of an asset due to a preventable oversight. Ouchies.
This isn't really helpful for people who want open AI though, because if your strategy is to deny OpenAI data and knowledge then you aren't going to release any models either.
> Llama.cpp 30B
> LLaMA-65B
the "number B" stands for "number of billions" of parameters... trained on?
like you take 65 billion words (from paragraphs / sentences from like, Wikipedia pages or whatever) and "train" the LLM. is that the metric?
why aren't "more parameters" (higher B) always better? aka return better results
how many "B" parameters is ChatGPT on GPT3.5 vs GPT4?
GPT3: 175b
GPT3.5: ?
GPT4: ?
https://blog.accubits.com/gpt-3-vs-gpt-3-5-whats-new-in-open...
how is Llama with 13B parameters able to compete with GPT3 with 175B parameters? It's 10x+ less? How much RAM goes it take to run "a single node" of GPT3 / GPT3.5 / GPT4?
Most people don't have the hardware or budget to access these specialized high vram GPUs.
The problem is how in the world is ChatGPT so good compared to the average human being? The answer is that human beings (except for the 1%), have their left hands tied behind their back because of copyright law.
I did the following:
1. Create a new working directory.
2. git clone https://github.com/ggerganov/llama.cpp
3. Download the latest release from https://github.com/ggerganov/llama.cpp/releases (note the CPU requirements in the filename) and unzip directly into the working directory's llama.cpp/ - you'll have the .exe files and .py scripts in the same directory.
4. Open PowerShell, cd to the working directory/llama.cpp, and create a new Python virtual environment: python3 -m venv env and activate the environment: .\env\Scripts\Activate.ps1
5. Obtain the LLaMA model(s) via the magnet torrent link and place them in the models directory. I used 30B and it is slow, but usable, on my system. Not even ChatGPT 3 level especially for programming questions, but impressive.
6. python3 -m pip install torch numpy sentencepiece
7. python3 convert-pth-to-ggml.py models/30B/ 1 (you may delete the original .pth model files after this step to save disk space)
8. .\quantize.exe ./models/30B/ggml-model-f16.bin ./models/30B/ggml-model-q4_0.bin 2
9. I copied the examples/chat-13B.bat to a new chat-30B.bat file, updated the model directory, and changed the last line of the script to: .\main.exe
10. Run using: .\examples\chat-30B.bat
https://github.com/ggerganov/llama.cpp#usage has details, although it assumes 7B and skips a few of the above steps.
does it happen to run on CPU on a server with 96GB RAM?
Suppose you have a big piece of compute hardware (e.g. at a university) which is shared by multiple users. They all want to come in and play with these models. Each one is tens to hundreds of gigabytes. Is each user supposed to have their own copy in their home directory?
But now we encounter this broken nonsense because solved problems get unsolved by bad software.
Maybe, not sure? My point was that both the syntax and Google using it was more relevant 15 years ago than now.
(I don't have much of an opinion on the 15+ years ago thing.)
If I create a website for tracking real estate trends in my area — which is public information — should I not be able to sell that information?
Similarly if a consulting company analyzes public market macro trends are they not allowed to sell that information?
Just because the information which is being aggregated and organized is public does not necessarily mean that the output product should be in the public.
Python concrete syntax is harder to manipulate programmatically compared to Javascript concrete syntax.
For instance, to insert one statement into another, we need to traverse the lines of that syntax and add the right amount of indentation. We can't just plant the syntax into the desired spot and be done with it.
Is python syntax worse than any brand new languages like rust or go? Absolutely not. It's still better.
Did Google stop using it? I don't think so, but I also don't think people picked it just because Google did.
Python is not the cause of dependency hell. Deep dependency trees are. The only way to deal with this is to use seperate environments and to carefully specify the exact requirements.
Those who claim some language would be a magical fix clearly lack experience in multiple languages.
No, it's just the size of the network (i.e. number of learnable parameters). The 13/30/65B models were each trained on ~1.4 trillion tokens of training data (each token is around half a word).
Repeat, ad infinitum. In the cracks you'll find people re-learning things they should've known, if only they weren't slagging off the grey beards .. or, even worse .. as grey beards not paying attention to the discoveries of youth.
>Most people don't know about it. The ones who do, are reluctant to use it.
Not so sure about this. The reluctance is emotional, its not technical. Nobody is killing POSIX under all of this - it is deployed. Therefore, learn it.
>so any thought devs may have had previously about using mmap() would have certainly triggered fears w.r.t. WIN32
Does not compute. Own up, you're an AI.
Because the text we write is not evenly distributed random noise, what we encode into it (by writing) is entropy.
Because LLMs model text with inference, they model all of the entropy that is present.
That would mean that the resulting size would be a measure of entropy (sum of patterns) divided by repetition (recurring patterns). In this count, I would consider each unique token alone an instance of the identity pattern.
So to answer both questions: yes.
The impression about Haskell’s nicheness compared with OCaml prevails. But Haskell has a larger userbase and a larger library ecosystem than OCaml.
Btw, I wish they would take some inspiration from Haskell's syntax.
Haskell also has significant whitespace, but its defined as syntactic sugar for a more traditionally syntax with curly braces and semicolons.
Approximately no-one uses that curly-brace syntax, but it's good for two things:
- silences the naysayers
- more importantly: allows you to copy-paste code even into forms that mess up your indentation.
Assuming you have the model file downloaded (you can use wget to download it) these are the instructions to install and run:
pkg install git
pkg install cmake
pkg install build-essential
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j
./main
I've been very _careful_ too (using pyenv/virtualenvs etc) for dependency management, but with Nvidia driver dependencies and "missing 'sqlite3/bz2' issues related to the underlying interpreter (not to mention issues with different Python3.x versions) I'm lucky to be able to even run a 'hello world' ML sample after an afternoon of fighting with it.
My Ubuntu install w/ Nvidia card only seems to recognize the GPU in some circumstances even when using the same `conda` env. Often this is remedied by rebooting the machine(?).
No idea how companies manage this stuff in production. Absolute minefield that seems to catastrophically break if you sneeze at it.
I'll admit I am not an expert in managing ML envs, but I've dealt with a lot of python environments for typical CRUD stuff, and while rough at times, it was never this bad.
At the time I evaluated other languages to learn, narrowed it down to Ruby and Python, and picked Python as I felt it had a nicer syntax than Ruby. And the "one way to do things" philosophy. This was back in 2005 or so.
What other languages of that period would you say had a nicer syntax than Python?
16 cores would be about 4x faster than the default 4 cores. Eventually you hit memory bottlenecks. So 32 cores is not twice as fast as 13 cores unfortunately.
It's like Python 2 vs Python 3 except even worse.
You're completely correct that the speed-sensitive parts are written in lower-level libraries, but another way to phrase that is "Python can go really fast, as long as you don't use Python." But this also means ML is effectively hamstrung into only using methods that already exist and have been coded in C++, since anything in Python would be too slow to compete.
There's lots of languages that make good tradeoffs between performance and usability. Python is not one of those languages. It is, at best, only slightly harder to use than Julia, yet orders-of-magnitude slower.
similar stuff is being research under "langchains" term
In other words, with enough data interleaving between enough NVME SSDs, you should have SSD throughput of the same order of magnitude as the system RAM.
The weights are static, so it’s just reads.