Llama.cpp 30B runs with only 6GB of RAM now

Yes, the 30B model is working for me on Windows 10 / AMD 5600G CPU / 32GB RAM, with llama.cpp release master-3525899 (already one release out of date!), in PowerShell, using the Python 3.10 version that automatically installs when you type "python3".

I did the following:

1. Create a new working directory.

2. git clone https://github.com/ggerganov/llama.cpp

3. Download the latest release from https://github.com/ggerganov/llama.cpp/releases (note the CPU requirements in the filename) and unzip directly into the working directory's llama.cpp/ - you'll have the .exe files and .py scripts in the same directory.

4. Open PowerShell, cd to the working directory/llama.cpp, and create a new Python virtual environment: python3 -m venv env and activate the environment: .\env\Scripts\Activate.ps1

5. Obtain the LLaMA model(s) via the magnet torrent link and place them in the models directory. I used 30B and it is slow, but usable, on my system. Not even ChatGPT 3 level especially for programming questions, but impressive.

6. python3 -m pip install torch numpy sentencepiece

7. python3 convert-pth-to-ggml.py models/30B/ 1 (you may delete the original .pth model files after this step to save disk space)

8. .\quantize.exe ./models/30B/ggml-model-f16.bin ./models/30B/ggml-model-q4_0.bin 2

9. I copied the examples/chat-13B.bat to a new chat-30B.bat file, updated the model directory, and changed the last line of the script to: .\main.exe

10. Run using: .\examples\chat-30B.bat

https://github.com/ggerganov/llama.cpp#usage has details, although it assumes 7B and skips a few of the above steps.