My guess would be that the model is faulted into memory lazily page by page (4K or 16K chunks) as the model is used, so only the actual parts that are needed are loaded.
The kernel also removes old pages from the page cache to make room for new ones, and especially so if the computer is using a lot of its RAM. As with all performance things, this approach trades off inference speed for memory usage, but likely faster overall because you don't have to read the entire thing from disk at the start. Each input will take a different path through the model, and will require loading more of it.
The cool part is that this memory architecture should work just fine with hardware acceleration, too, as long as the computer has unified memory (anything with an integrated GPU). This approach likely won't be possible with dedicated GPUs/VRAM.
This approach _does_ still work to run a dense model with limited memory, but the time/memory savings would just be less. The GPU doesn't multiply every matrix in the file literally simultaneously, so the page cache doesn't need to contain the entire model at once.
The only reason I posted it is because Facebook had been DMCAing a few repos, and I wanted to reassure everyone that they can hack freely without worry. That’s all.
I’m really sorry if I overshadowed your moment on HN, and I feel terrible about that. I’ll try to read the room a little better before posting from now on.
Please have a wonderful weekend, and thanks so much for your hard work on LLaMA!
EDIT: The mods have mercifully downweighted my comment, which is a relief. Thank you for speaking up about that, and sorry again.
If you'd like to discuss any of the topics you originally posted about, you had some great points.
Seeing such clever use of mmap makes me dread to imagine how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions.
Perhaps SWE is dead after all, but LLMs didn't kill it...
Edit: looks like there's now confirmation that running it on a 10GB VM slows inference down massively, so looks like the only thing strange is the memory usage reading on some systems.
Did that metric meaningfully change when the amount of required memory dropped?
If the amount of diversity is lowered, I would expect that to lower the amount of patterns to be modeled from the text. If that is the case, then the resulting model size itself would be lowered, during and after training.
30B quantized requires 19.5 GB, not 6GB; Otherwise severe swapping to disk
model original size quantized size (4-bit)
7B 13 GB 3.9 GB
13B 24 GB 7.8 GB
30B 60 GB 19.5 GB
65B 120 GB 38.5 GB
Probably not all that much. All of the Python numeric computing frameworks (Numpy, PyTorch, TensorFlow, etc.) are basically just wrappers for lower level C++/C/Fortran code. Unless you’re doing something boneheaded and converting framework-native tensors to Python objects, passing tensors around within a framework essentially just passes a pointer around, which has marginal overhead even when encapsulated in a bloated Python object.
Indeed, a huge number of PyTorch operations are explicitly zero copy: https://pytorch.org/docs/stable/tensor_view.html
Why waste developer hours (meaning effort) if you can just scale the infra for a little cash? Do it in small enough increments and the increases only outweigh FTEs if you consider all scaling events and look at a long enough time scale.
Suddenly it takes way too much for way too little, but it cost half as many overpaid developers who can’t be arsed to performance.
Edit: in case that sounds like the opposite of intended, ggerganov and jart are the outliers, the exception.
Thank you for the amazing work. It’s so appreciated by so many on HN like me I’m sure.
H_s(x) := -\sum_{x \in X_s} p(x) log(p(x))
where X_s := all s-grams from the training set? That seems like it would eventually become hard to impossible to actually compute. Even if you could what would it tell you?Or, wait... are you referring to running such an analysis on the output of the model? Yeah, that might prove interesting....
How is that possible? Is the model being compressed even more (even after converting to 4 bit) somehow? Or is most of the model unused?
It also shows the number of impostors in this thread and inflated titles of self-proclaimed 'seniors' who can't optimize ML code to even be on the same league as Tunney (jart), and Gerganov (ggerganov).
Not even ChatGPT or Copilot could even submit a change or in-fact completely rewrite and optimize this code like they have done.
I think that's just an accounting thing. Many UNIX variants will not "charge" read only memory mapped pages to a process, because they could be shared among many processes and evicted at will.
It's not the easiest syntax, not the best compiler support, performance and threading is a joke. The entire language is based on hype back from the time when the only two mainstream languages were C++ and Java.
This leads to all sorts of bugs and breaking changes that are cool in an academic or hobbyist setting but a total headache on a large production system.
A long story short, in the future the AI can just convert all our code to FORTH or HolyC or some "creative" combination of languages chosen by prophecy (read: hallucination) perhaps even Python — as a show of strength.
zero_buffer (CFFI, 2013) https://github.com/alex/zero_buffer/blob/master/zero_buffer....
"Buffers on the edge: Python and Rust" (2022) https://alexgaynor.net/2022/oct/23/buffers-on-the-edge/ :
> If you have a Python object and want to obtain its buffer, you can do so with memoryview in Python or PyObject_GetBuffer in C. If you’re defining a class and want to expose a buffer, you can do so in Python by… actually you can’t, only classes implemented in C can implement the buffer protocol. To implement the buffer protocol in C, you provide the bf_getbuffer and bf_releasebuffer functions which are called to obtain a buffer from an object and when that buffer is being released, respectively.
iocursor (CPython C API, ~Rust std::io::Cursor) https://github.com/althonos/iocursor
Arrow Python (C++) > On disk and MemoryMappedFile s: https://arrow.apache.org/docs/python/memory.html#on-disk-and...
"Apache Arrow: Read DataFrame With Zero Memory" (2020) https://towardsdatascience.com/apache-arrow-read-dataframe-w...
pyarrow.Tensor: https://arrow.apache.org/docs/python/generated/pyarrow.Tenso...
ONNX is built on protocolbuffers/protobufs (google/protobufs), while Arrow is built on google/flatbuffers.
FlatBuffers https://en.wikipedia.org/wiki/FlatBuffers :
> It supports “zero-copy” deserialization, so that accessing the serialized data does not require first copying it into a separate part of memory. This makes accessing data in these formats much faster than data in formats requiring more extensive processing, such as JSON, CSV, and in many cases Protocol Buffers. Compared to other serialization formats however, the handling of FlatBuffers requires usually more code, and some operations are not possible (like some mutation operations).
It's the easiest among most popular languages. It uses the least amount of symbols, parenthesis and braces only for values.
Some people don't like the significant whitespace, but that helps readability.
Pull requests and stars on github? That might be a start.
https://madnight.github.io/githut/#/pull_requests/2022/4 https://madnight.github.io/githut/#/stars/2022/4
Though you may say but but alltheprivaterepos! Then I challenge you to back up what you mean by relevance and prove python is a category of relevant 15+ years ago.
It's not like the zero copy buzzword is going to help you during training, all your weights have to stay on GPU, you are going to sample your training data randomly and your data is on a networked storage anyway, so mmap HURTS. You'd better just O_DIRECT.
Similarly, as long as you run your inference on GPU it's not like you can mmap... And I have indeed worked on inference runtimes for mobile devices and on the rare cases we need to use CPU only (hey, your phone also have GPUs since forever) at $PREVIOUS_JOB we did have a mmap-able model format, it also helps in TEE/SGX/whatever enclave tech. Oh, and there are no Python at all.
The recent development of ggml is interesting as it catches a moment that "big ML shop infra" guys don't care: running models on Apple Silicon. M1/M2s are expensive enough that we don't consider deploying them instead of those 1000000000 bizarre accelerators in production, yet everyone on HN seems to have one and hey it's fast enough for LMs. They are rather unique as they are CPU+high bandwidth RAM+accelerators with totally shared RAM with CPU, instead of some GPU shit.
tldr it's not like "big ML shop infra" guys are stupid and leaves performance on table. They just don't run their production workload on MacBooks. That's where the community shine right?
In fact I'd love to see that Transformer really dominates. We can then start to converge on software. And compute-wise transformers are really simple, too!
It doesn’t excel at anything, but anything a software can do, it can be done in Python somehow.
So, a great pick when you’ve got no idea where you’re going to, when you’re prototyping, when you don’t care about performance or perfection.
I agree that for large scale systems when you already know what you’re doing, Python shows its limits quite soon (and we should add the problems with missing/slow type checking that slows down large scale systems development).
Huh? Why?
You can barely deploy it to Web.
it doesn't scale perfoance wise
you can't built robust abstractions
The REPL is merely OK
You can barely ship working code without containers
the syntax is hard to manipulate programmatically
Python has inertia but it's holding us back
is there any evidence that this makes it easier?
people learn python as beginners because it has a reputation for being easy for beginners
I don't see anything about the syntax that makes it inherently easier
It sounds unnecessarily weird to me that people would share Python code that simply doesn't work out at all out of the box.
Compared to what? Unindented or badly indented code in other languages?
In other languages you can move code around and it still works - and nobody prevents you from adding whitespace for readeability (it may be even done automatically for you).
If there was a superior alternative that covers the breadth of the Python ecosystem I’m pretty sure no one would have any scruples in using it. A programming language and its syntax is the least interesting or complex part when it comes to solving problems. Just rattling off some amazing libraries I've used over the last few years:
https://scikit-image.org - Image processing
https://imgaug.readthedocs.io - Image augmentation
https://scikit-learn.org/stable - ML
https://pymoo.org - Multi objective optimization
https://simpy.readthedocs.io/ - Discrete event simulation
https://lifelines.readthedocs.io - Survival analysis
https://bambinos.github.io/bambi - Bayesian modeling
https://unit8co.github.io/darts/ - Time series forecasting
https://abydos.readthedocs.io/en/latest/abydos.distance.html - Basically any string distance metric you can think of
The list just goes on and on.. oh yeah, some Deep Learning libraries too, which some people find useful.
Having said that, I've deployed two large Django projects on the web with tons of customers and it runs and scales just fine, and it's a DREAM to maintain and develop for than for example Java.. I would go so far as to say the opposite, if you haven't used Python for web deployment you've been missing out! (you lose some efficiency I'm sure but you gain other things)
I liked the one way of doing most things philosophy, coming off working on a large C++ code base.
Though in practice, in many cases, mmap won't be faster, it can be even slower than open+read.
Just wanna say, that this use of mmap() is cleverly used in this context, but should be acknowledged as a widely accepted industry standard practice for getting higher performance, particularly in embedded applications but also in performance-oriented apps such as digital audio workstations, video editing systems, and so on.
Tragedy of folks forgetting how to program.
This mmap() "trick" isn't a trick, its a standard practice for anyone who has cut their teeth on POSIX or embedded. See also mlock()/munlock() ..
The trope about it being the 2nd best language for everything isn't correct. It's taught in universities because it has a very short time to gratification, and the basic syntax is quite intuitive. Academics latched onto it for ML because of some excellent libraries, and it became established as a vital part of the ecosystem from there.
But it's a nightmare to support a moderate to large codebase in production, packaging continues to be a mess, and it's full of weird quirks. Great for weekend projects, but for pete's sake take a minute and port them into something more reliable before going to production with them.
Sure, but that is the gun, especially (as reflected in your examples) for machine learning. The best frameworks (PyTorch, TensorFlow, JAX) are all Python, with support for other languages being an afterthought as best.
The use of scripting languages (Python, Lua - original Torch) for ML seems to have started partly because he original users were non-developers, more from a math/stats background, and partly because an interactive REPL loop is good for a field like this that is very experimental/empirical.
Does it make sense that we're now building AGI using a scripting language? Not really, but that's where we are!
Another suggestion is that not all of the word/token embedding table might be used, which would be a function of the input used to test, but that would be easy enough to disprove as there would then be different memory usage for different inputs.
It seems possible the reported memory usage is lower than reality if that's how mmap/top work. In any case, a good use of mmap it seems, especially since for a multi-layer model layer weights will be used sequentially so paged load-on-demand will work relatively well even in a low memory situation.
Never understood why people think that indented languages are any simpler when in fact they bring all kinds of trouble for getting things done.
but to your point, until technology itself actually replaces us, deeply skilled computer people are always going to be able to squeeze more performance out of software implemented in high level languages by those who have not studied computers extensively.
Python is more readable than C. Way better than C++. Far simpler to reason about than Java. Maybe Typescript is on a similar level, but throwing a beginner into the JS ecosystem can be daunting. Perhaps Ruby could be argued as equally simple, but it feels like that's a dead end language these days. Golang is great, but probably not as easy to get rolling with as Python.
What else? Are you going to recommend some niche language no one hires for?
The square brackets alone make it a winner. Array, list and strings indexing. Dictionary lookups. Slices and substrings. List comprehensions. The notations convenience of this alone is immense.
Built in list, string, and dicts. For the 90% of code that is not performance critical, this is a godsend. Just looking at the c++ syntax for this makes me never want to use a stl data structure for anything trivial.
What people sometimes fail to understand is that code is a mean to an end, not an end in itself.
If you want to make code for itself, work on an opensource and/or personal project. If you are paid to work on something, you're paid for the something to get out, not for it to feature the best code ever.
The peculiarity here is that tools like htop were reporting the improvement as being an 8x improvement, which is interesting, because RAM use is only 2x better due to my change. The rusage.com page fault reporting was also interesting too. This is not due to sparseness. It's because htop was subtracting MAP_SHARED memory. The htop docs say on my computer that the color purple is used to display shared memory, and yellow is used to display kernel file caches. But it turned out it just uses yellow for both, even though it shouldn't, because mincore() reported that the shared memory had been loaded into the resident set size.
Or hiring useless business people to install around the periphery of engineering. Which is funny because now tech is letting all those folks go.
Ocaml is very niche, I feel it’s an hard sell for a general purpose language. Haskell, 3x that.
JS and TS, could be. But are they so much better than Python, if better at all?
If you just write straight c++ (without c++xx, or anything like it) you can compile the code on machines from decades ago if you want.
What's happened to the popularity of all of these languages since 2010? Outside of JS/TS, absolutely nothing. If anything, they've lost mindshare.
You could run notebooks entirely client side https://jupyterlite.readthedocs.io/en/latest/
The startup is slow but otherwise it is pretty functional.
.NET has P/Invoke which is much nicer.
JVM is getting Panama+jextract, which is the nicest yet. You can go straight from header files to pure Java bindings which don't need any extra native code at all. But it's not shipped yet :(
Strong disagreement. Explicit types make reasoning about Java much easier, especially when you are in an unfamiliar codebase.
Python is not quite the 'write-only' language of Perl, but it is a lot easier to write it than it is to read it.
This isn't just a matter of making the 30B model run in 6GB or whatever. You can now run the largest model, without heavy quantization, and let the OS figure it out. It won't be as fast as having "enough" memory, but it will run.
In theory you could always have done this with swap, but swap is even slower because evictions have to be written back to swap (and wear out your SSD if your swap isn't on glacially slow spinning rust) instead of just discarded because the OS knows where to read it back from the filesystem.
This should also make it much more efficient to run multiple instances at once because they can share the block cache.
(I wonder if anybody has done this with Stable Diffusion etc.)
Even if it doesn't have the best syntax now (which I doubt), the tooling and libraries make it a better choice over any language that have an edge over python syntax.
You can easily build a standalone binary (well, it would be GiB+ if you use CUDA... but that's the cost of statically linking cu*), had you coded your model and training loop in C++.
It then happily runs everywhere as long as a NVIDIA GPU driver is available (don't need to install CUDA).
Protip: Your AI research team REALLY DON'T WANT TO DO THIS BECAUSE THEY LOVE PYTHON. Having Python, even with the dependency management shit, is a feature, not a bug.
(if you want Rust / Go and don't want to wrapping libtorch/tf then you have a lot of work to do but yeah it's possible. also there are model compiler guys [1] where the promise is model.py in model.o out you just link it with your code)
[1] https://mlc.ai
You’d do better complaining about still nascent, compared to alternatives, async support or lack of jit in the official implementation.
The forced use of spacing to delineate blocks means you will never see a bunch of brackets eating up screen space and the common error where someone adds another line to an if statement but doesn't add braces.
Semicolons not being conventional means less screen noise and less code golf 1 liners.
The focus on imperative vs functional means you rarely ever see something like this a(b(c(d(e(f(g))))
PHP suffers greatly from poorly named standard functions on top of all of that.
Don't get me started on Ruby metaprogramming.
These are just the things I could think of off the top of my head. I do not want to spend my afternoon on this. This is just my experience looking at code for over 20 years, you either believe it or you don't. There's no scientific studies to prove that 1 syntax feature is superior.
I highly doubt that everyone chose python just because Google did. Python was a giant step in syntax compared to the competition back then, and now even if there is a new language out there right now that has a better syntax, it's not going to be better by much, and it is not going to have the tooling, libraries, or the community.
Suppose you have a big piece of compute hardware (e.g. at a university) which is shared by multiple users. They all want to come in and play with these models. Each one is tens to hundreds of gigabytes. Is each user supposed to have their own copy in their home directory?
But now we encounter this broken nonsense because solved problems get unsolved by bad software.
Maybe, not sure? My point was that both the syntax and Google using it was more relevant 15 years ago than now.
(I don't have much of an opinion on the 15+ years ago thing.)
Python concrete syntax is harder to manipulate programmatically compared to Javascript concrete syntax.
For instance, to insert one statement into another, we need to traverse the lines of that syntax and add the right amount of indentation. We can't just plant the syntax into the desired spot and be done with it.
Is python syntax worse than any brand new languages like rust or go? Absolutely not. It's still better.
Did Google stop using it? I don't think so, but I also don't think people picked it just because Google did.
Python is not the cause of dependency hell. Deep dependency trees are. The only way to deal with this is to use seperate environments and to carefully specify the exact requirements.
Those who claim some language would be a magical fix clearly lack experience in multiple languages.
Repeat, ad infinitum. In the cracks you'll find people re-learning things they should've known, if only they weren't slagging off the grey beards .. or, even worse .. as grey beards not paying attention to the discoveries of youth.
>Most people don't know about it. The ones who do, are reluctant to use it.
Not so sure about this. The reluctance is emotional, its not technical. Nobody is killing POSIX under all of this - it is deployed. Therefore, learn it.
>so any thought devs may have had previously about using mmap() would have certainly triggered fears w.r.t. WIN32
Does not compute. Own up, you're an AI.
Because the text we write is not evenly distributed random noise, what we encode into it (by writing) is entropy.
Because LLMs model text with inference, they model all of the entropy that is present.
That would mean that the resulting size would be a measure of entropy (sum of patterns) divided by repetition (recurring patterns). In this count, I would consider each unique token alone an instance of the identity pattern.
So to answer both questions: yes.
The impression about Haskell’s nicheness compared with OCaml prevails. But Haskell has a larger userbase and a larger library ecosystem than OCaml.
Btw, I wish they would take some inspiration from Haskell's syntax.
Haskell also has significant whitespace, but its defined as syntactic sugar for a more traditionally syntax with curly braces and semicolons.
Approximately no-one uses that curly-brace syntax, but it's good for two things:
- silences the naysayers
- more importantly: allows you to copy-paste code even into forms that mess up your indentation.
Assuming you have the model file downloaded (you can use wget to download it) these are the instructions to install and run:
pkg install git
pkg install cmake
pkg install build-essential
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j
./main
I've been very _careful_ too (using pyenv/virtualenvs etc) for dependency management, but with Nvidia driver dependencies and "missing 'sqlite3/bz2' issues related to the underlying interpreter (not to mention issues with different Python3.x versions) I'm lucky to be able to even run a 'hello world' ML sample after an afternoon of fighting with it.
My Ubuntu install w/ Nvidia card only seems to recognize the GPU in some circumstances even when using the same `conda` env. Often this is remedied by rebooting the machine(?).
No idea how companies manage this stuff in production. Absolute minefield that seems to catastrophically break if you sneeze at it.
I'll admit I am not an expert in managing ML envs, but I've dealt with a lot of python environments for typical CRUD stuff, and while rough at times, it was never this bad.
At the time I evaluated other languages to learn, narrowed it down to Ruby and Python, and picked Python as I felt it had a nicer syntax than Ruby. And the "one way to do things" philosophy. This was back in 2005 or so.
What other languages of that period would you say had a nicer syntax than Python?
16 cores would be about 4x faster than the default 4 cores. Eventually you hit memory bottlenecks. So 32 cores is not twice as fast as 13 cores unfortunately.
It's like Python 2 vs Python 3 except even worse.