However, Do I need to Install CUDA toolkit on host?
I haven't install CUDA toolkit when I use on Containerized platform (like docker)
However, Do I need to Install CUDA toolkit on host?
I haven't install CUDA toolkit when I use on Containerized platform (like docker)
Nvidia driver + Nvidia container toolkit would do the job. You could check official instructions at [0]
[0] https://docs.nvidia.com/datacenter/cloud-native/container-to...
Edit: I've loaded llama 3.1 8b instruct GGUF and I got 12.61 tok/sec and 80tok/sec for 3.2 3b.
However I've found quality of smaller models to be quite lacking. The Llama 3.2 3B for example is much worse than Gemma2 9B, which is the one I found performs best while fitting comfortably.
Actual sentences are fine, but it doesn't follow prompts as well and it doesn't "understand" the context very well.
Quantization brings down memory cost, but there seems to be a sharp decline below 5 bits for those I tried. So a larger but heavily quantized model usually performs worse, at least with the models I've tried so far.
So with only 6GB of GPU memory I think you either have to accept the hit on inference speed by only partially offloading, or accept fairly low model quality.
Doesn't mean the smaller models can't be useful, but don't expect ChatGPT 4o at home.
That said if you got a beefy CPU then it can be reasonable to have it do a few of the layers.
Personally I found Gemma2 9B quantized to 6 bit IIRC to be quite useful. YMMV.
Personally, I have some notes and bookmarks that I'd like to scrape, then have an LLM summarize, generate hierarchical tags, and store in a database. For the notes part at least, I wouldn't want to give them to another provider; even for the bookmarks, I wouldn't be comfortable passing my reading profile to anyone.
Testing performance this way, I got about 0.5-1.5 tokens per second with an 8GB 4bit quantized model on an old DL360 rack-mount server with 192GB RAM and 2 E5-2670 CPUs. I got about 20-50 tokens per second on my laptop with a mobile RTX 4080.
This is one of the reason why recently I added floating chat to https://recurse.chat/ to quickly access local LLM.
Here's a demo: https://x.com/recursechat/status/1846309980091330815
It was so easy to get other non-AI stuffs running!
You can find plenty of uncensored LLM models here:
[1]: I personally suspect that many LLMs are still trained on WebText, derivatives of WebText, or using synthetic data generated by LLMs trained on WebText. This might be why they feel so "censored":
>WebText was generated by scraping only pages linked to by Reddit posts that had received at least three upvotes prior to December 2017. The corpus was subsequently cleaned
The implications of so much AI trained on content upvoted by 2015-2017 redditors is not talked about enough.
It reminds a bit of making web sites with a page builder. Easy to install and click around to get something running without thinking too much about it fairly quickly.
Problems are quite similar also, training wheels getting stuck in the woods more easily, hehe.
That being said, I think the more straightforward approach would be to utilize an existing library like https://github.com/collabora/WhisperLive/ within a Docker container. This way, you can call it via WebSocket and integrate it with my LLM, which could also serve as a nice feature in your product.
I've actually been playing around with speech to text recently. Thank you for the pointer, docker is a bit too heavy to deploy for desktop app use case but it's good to know about the repo. Building binaries with Pyinstaller could be an option though.
Real time transcription seems a bit complicated as it involves VAD so a feasible path for me is to first ship simple transcription with whisper.cpp. large-v3-turbo looks fast enough :D
Given a 16gb system with cpu inference only, I’m hosting gemma2 9b at q8 for llm tasks and SDXL turbo for image work and besides the memory usage creeping up for a second or so while i invoke a prompt, they’re basically undetectable in the background.
Self-hosting static or almost-static websites is now really easy with a Cloudflare front. I just closed my account on SmugMug and published my images locally using my NAS; this costs no extra money (is basically free) since the photos were already on the NAS, and the NAS is already powered on 24-7.
The NAS I use is an Asustor so it's not really Linux and you can't install what you want on it, but it has Apache, Python and PHP with Sqlite extension, which is more than enough for basic websites.
Cloudflare free is like magic. Response times are near instantaneous and setup is minimal. You don't even have to configure an SSL certificate locally, it's all handled for you and works for wildcard subdomains.
And of course if one puts a real server behind it, like in the post, anything's possible.
Depends on the model, but in general, no.
...but it's fine for simple 1 liner commands like "how do I revert my commit?" or "rename these files to camelcase".
> How often does it fail?
Immediately and constantly if you ask anything hard.
An 8b model is not chat-gpt. The 3B model in the OP post is not chat-gpt.
The capability compared to sonnet/4o is like a potato and a car.
Search for 'LLM Leaderboard' and you can see for yourself. The 8b models do not even rank. They're generally not capable enough to use as a self hosted assistant.
But I haven't yet find any "uncensored" ones (on ollama) that works. Did I miss something?
(On the contrary: when ChatGPT first came out, it was trivial to jailbreak it to make it write erotica.)
I have a VPN on a raspberry pi and with that I can connect to my self hosted cloud, dev/staging servers for projects, gitlab and etc when I’m not on my home network.
I use a Tesla P4 for ML stuff at home, it's equivalent to a 1080 Ti, and has a score of 7.1. A 2070 (they don't list the "super") is a 7.5.
For reference, 4060 Ti, 4070 Ti, 4080 and 4090 are 8.9, which is the highest score for a gaming graphics card.
I tried gemma-2-27b-it-Q4_K_L but it's not as good, despite being larger.
Using llama.cpp and models from here[1].
Cloudflare is pretty strict about the Html to media ratio and might suspend or terminate your account if you are serving too many images.
I've read far too many horror stories about this on hn only so please make sure what you're doing is allowed by their TOS.
e.g. is running a personal photography website OK?
PS: talking about Cloudflare being snappy when content is being served from a austore nas made me chuckle.
Take a look at if Cloudflare Pages + Cloudflare R2 meets the needs of your site.
I'd also recommend using cloudflare tunnels (under Zero Trust) rather than punching a hole in your firewall. For a number of reasons.
The images that are published are low-res versions copied to a directory on a partition accessible to the web server.
This is not the safest solution, as it does punch a hole in the lan... It's kind of an experiment... We'll see how it goes.
This is not true. On benchmarks, maybe, but I find the LLM Arena more accurately accounts for the subjective experience of using these things, and Llama 3.1 8B ranks relatively high, outperforming GPT-3.5 and certain iterations of 4.
Where the 8Bs do struggle is that they don't have as deep a repository of knowledge, so using them without some form of RAG won't get you as good results as using a plain larger model. But frankly I'm not convinced that RAG-free chat is the future anyway, and 8B models are extremely fast and cheap to run. Combined with good RAG they can do very well.
But yes, like the guy above said it's really only helpful for one line commands. Like if I forgot some sort flag thats available for a certain type of command. Or random things I don't work with often enough to memorize their little build commands etc. It's not helpful for programming just simple commands.
It also can help with unstructured or messy data to make it more readable, although there's potential to hallucinate if the context is at all large.
> 8B models are extremely fast and cheap to run
yes.
> Combined with good RAG they can do very well.
This is simply not true. They perform at a level which is useful for simple, trivial tasks.
If you consider that 'doing well', then sure.
However, if, like the parent post, you want to be writing scripts, which is specifically what they asked... then: heck, what 8B are you using, because llama 3.1 is shit at it out of the box.
¯\_(ツ)_/¯
A working unit test can take 6 or 7 iterations with a good prompt. Forget writing logic. Creating classes? Using RAG to execute functions from a spec? Forget it.
That's not not the level that I need for an assistant.