←back to thread

24 points pajeets | 2 comments | | HN request time: 0.402s | source

First time I'm doing colocation and as experiment I picked up a 1TB DDR4 RAM server with 24 cores (around 2.2ghz each)

What should I do now? What stuff should I run?

I have a hello world Flask app running but obviously its not enough to use the full potential.

I'm thinking of running KVM and selling a few VDS to friends or companies.

Also thought of running thousands of Selenium browser tests but I do this maybe once a year, not enough to fully utilize the server 24/7

Help! I might have gone overboard with server capacity, I will never have to pay for AWS again, I can literally run every single project, APIs, database I want and still have space left over.

Show context
evanjrowley ◴[] No.41872209[source]
Try llama.cpp with the biggest LLM you can find.
replies(1): >>41872686 #
pajeets ◴[] No.41872686[source]
need a 3090 at least for that
replies(1): >>41880369 #
kkielhofner ◴[] No.41880369[source]
llama.cpp and others can run purely on CPU[0]. Even production grade serving frameworks like vLLM[1].

There are a variety of other LLM inference implementations that can run on CPU as well.

[0] - https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#su...

[1] - https://docs.vllm.ai/en/v0.6.1/getting_started/cpu-installat...

replies(1): >>41882484 #
1. pajeets ◴[] No.41882484[source]
wait this is crazy

what model can i run on 1TB and how many tokens per second ?

for instance Nvidia Nemotron Llama 3.1 quantized at what speed ? ill get a GPU too but not sure how much VRAM I need for the best value for your buck

replies(1): >>41884181 #
2. kkielhofner ◴[] No.41884181[source]
> what model can i run on 1TB

With 1TB of RAM you can run nearly anything available (405B essentially being the largest ATM). Llama 405B in FP8 precision fits in H100x8 which is 640GB VRAM. Quantization is a very deep and involved well (far too much for an HN comment).

I'm aware it "works" but I don't bother with CPU, GGUF, even llama.cpp so I can't really speak to it. They're just not even remotely usable for my applications.

> tokens per second

Sloooowwww. With 405B it could very well be seconds per token but this is where a lot of system factors come in. You can find benchmarks out there but you'll see stuff like a very high spec AMD EPYC bare metal system with very fast DDR4/5, tons of memory channels, etc doing low single-digit tokens per second with 70B.

> ill get a GPU too but not sure how much VRAM I need for the best value for your buck

Most of my experience is top-end GPU so I can't really speak to this. You may want to pop in at https://www.reddit.com/r/LocalLLaMA/ - there is much more expertise there for this range of hardware (CPU and/or more VRAM limited GPU configs).