Llama.cpp 30B runs with only 6GB of RAM now

(github.com)

1311 points msoad | 4 comments | 31 Mar 23 20:37 UTC | HN request time: 1.279s | source

Show context

jart ◴[31 Mar 23 21:02 UTC] No.35393615[source]▶

Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #

intelVISA ◴[31 Mar 23 22:03 UTC] No.35394288[source]▶

>>35393615 #

Didn't expect to see two titans today: ggerganov AND jart. Can ya'll slow down you make us mortals look bad :')

Seeing such clever use of mmap makes me dread to imagine how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions.

Perhaps SWE is dead after all, but LLMs didn't kill it...

replies(11): >>35395112 #>>35395145 #>>35395165 #>>35395404 #>>35396298 #>>35397484 #>>35398972 #>>35399367 #>>35400001 #>>35400090 #>>35456064 #

1. ok123456 ◴[31 Mar 23 23:22 UTC] No.35395112[source]▶

>>35394288 #

You can mmap from python.

replies(2): >>35397073 #>>35400396 #

2. westurner ◴[01 Apr 23 04:17 UTC] No.35397073[source]▶

>>35395112 (TP) #

The CPython mmap module docs: https://docs.python.org/3/library/mmap.html

zero_buffer (CFFI, 2013) https://github.com/alex/zero_buffer/blob/master/zero_buffer....

"Buffers on the edge: Python and Rust" (2022) https://alexgaynor.net/2022/oct/23/buffers-on-the-edge/ :

> If you have a Python object and want to obtain its buffer, you can do so with memoryview in Python or PyObject_GetBuffer in C. If you’re defining a class and want to expose a buffer, you can do so in Python by… actually you can’t, only classes implemented in C can implement the buffer protocol. To implement the buffer protocol in C, you provide the bf_getbuffer and bf_releasebuffer functions which are called to obtain a buffer from an object and when that buffer is being released, respectively.

iocursor (CPython C API, ~Rust std::io::Cursor) https://github.com/althonos/iocursor

Arrow Python (C++) > On disk and MemoryMappedFile s: https://arrow.apache.org/docs/python/memory.html#on-disk-and...

"Apache Arrow: Read DataFrame With Zero Memory" (2020) https://towardsdatascience.com/apache-arrow-read-dataframe-w...

pyarrow.Tensor: https://arrow.apache.org/docs/python/generated/pyarrow.Tenso...

ONNX is built on protocolbuffers/protobufs (google/protobufs), while Arrow is built on google/flatbuffers.

FlatBuffers https://en.wikipedia.org/wiki/FlatBuffers :

> It supports “zero-copy” deserialization, so that accessing the serialized data does not require first copying it into a separate part of memory. This makes accessing data in these formats much faster than data in formats requiring more extensive processing, such as JSON, CSV, and in many cases Protocol Buffers. Compared to other serialization formats however, the handling of FlatBuffers requires usually more code, and some operations are not possible (like some mutation operations).

3. lostmsu ◴[01 Apr 23 14:03 UTC] No.35400396[source]▶

>>35395112 (TP) #

In fact, you can mmap from PyTorch directly.

replies(1): >>35401406 #

4. ok123456 ◴[01 Apr 23 16:11 UTC] No.35401406[source]▶

>>35400396 #

and numpy

↑