Most active commenters

taosx(3)
xyc(3)
worldsayshi(3)

I Self-Hosted Llama 3.2 with Coolify on My Home Server

(geek.sg)

1. taosx ◴[16 Oct 24 07:41 UTC] No.41856567[source]▶

For the people who self-host LLMs at home: what use cases do you have?

Personally, I have some notes and bookmarks that I'd like to scrape, then have an LLM summarize, generate hierarchical tags, and store in a database. For the notes part at least, I wouldn't want to give them to another provider; even for the bookmarks, I wouldn't be comfortable passing my reading profile to anyone.

replies(11): >>41856653 #>>41856701 #>>41856881 #>>41856970 #>>41856992 #>>41857395 #>>41858199 #>>41858353 #>>41861443 #>>41864562 #>>41890288 #

2. xyc ◴[16 Oct 24 07:54 UTC] No.41856653[source]▶

>>41856567 (TP) #

llama3.2 1b & 3b is really useful for quick tasks like creating some quick scripts from some text, then pasting them to execute as it's super fast & replaces a lot of temporary automation needs. If you don't feel like invest time into automation, sometimes you can just feed into an LLM.

This is one of the reason why recently I added floating chat to https://recurse.chat/ to quickly access local LLM.

Here's a demo: https://x.com/recursechat/status/1846309980091330815

replies(2): >>41856827 #>>41857089 #

3. segalord ◴[16 Oct 24 08:03 UTC] No.41856701[source]▶

>>41856567 (TP) #

I use it exclusively for users on my personal website to chat with my data. I've given the setup tools to have read access my files and data

replies(1): >>41856740 #

4. netdevnet ◴[16 Oct 24 08:13 UTC] No.41856740[source]▶

>>41856701 #

Is this not something that you can with non-hosted LLMs like ChatGPT? If you expose your data, it should be able to access it iirc

replies(1): >>41856978 #

5. taosx ◴[16 Oct 24 08:26 UTC] No.41856827[source]▶

>>41856653 #

Looks very nice, saved it for later. Last week, I worked on implementing always-on speech-to-text functionality for automating tasks. I've made significant progress, achieving decent accuracy, but I imposed some self-imposed constraints to implement certain parts from scratch to deliver a single binary deployable solution, which means I still have work to do (audio processing is new territory for me). However, I'm optimistic about its potential.

That being said, I think the more straightforward approach would be to utilize an existing library like https://github.com/collabora/WhisperLive/ within a Docker container. This way, you can call it via WebSocket and integrate it with my LLM, which could also serve as a nice feature in your product.

replies(1): >>41856926 #

6. TechDebtDevin ◴[16 Oct 24 08:37 UTC] No.41856881[source]▶

>>41856567 (TP) #

I keep an 8b running with ollama/openwebui to ask it to format things, summarization, and to generate SQL/simple bash commands and what not.

replies(1): >>41856966 #

7. xyc ◴[16 Oct 24 08:43 UTC] No.41856926{3}[source]▶

>>41856827 #

Thanks! lmk when/if you wanna give it a spin as free trial hasn't been updated with the latest but I'll try to do it this week.

I've actually been playing around with speech to text recently. Thank you for the pointer, docker is a bit too heavy to deploy for desktop app use case but it's good to know about the repo. Building binaries with Pyinstaller could be an option though.

Real time transcription seems a bit complicated as it involves VAD so a feasible path for me is to first ship simple transcription with whisper.cpp. large-v3-turbo looks fast enough :D

replies(1): >>41856983 #

8. worldsayshi ◴[16 Oct 24 08:49 UTC] No.41856966[source]▶

>>41856881 #

So 8b is really smart enough to write scripts for you? How often does it fail?

replies(1): >>41857044 #

9. laniakean ◴[16 Oct 24 08:50 UTC] No.41856970[source]▶

>>41856567 (TP) #

I mostly use it to write some quick scripts or generate texts if it follows some pattern. Also, getting it up running with LM studio is pretty straightforward.

10. worldsayshi ◴[16 Oct 24 08:51 UTC] No.41856978{3}[source]▶

>>41856740 #

You can absolutely do that but then you pay by the token instead of a big upfront hardware cost. It feels different I suppose. Sunk cost and all that.

11. taosx ◴[16 Oct 24 08:52 UTC] No.41856983{4}[source]▶

>>41856926 #

Yes it's fast enough, especially if you don't need something live.

12. ein0p ◴[16 Oct 24 08:54 UTC] No.41856992[source]▶

>>41856567 (TP) #

I run Mistral Large on 2xA6000. 9 times out of 10 the response is the same quality as GPT 4o. My employer does not allow the use of GPT for privacy related reasons. So I just use a private Mistral for that

13. wokwokwok ◴[16 Oct 24 09:03 UTC] No.41857044{3}[source]▶

>>41856966 #

> So 8b is really smart enough to write scripts for you?

Depends on the model, but in general, no.

...but it's fine for simple 1 liner commands like "how do I revert my commit?" or "rename these files to camelcase".

> How often does it fail?

Immediately and constantly if you ask anything hard.

An 8b model is not chat-gpt. The 3B model in the OP post is not chat-gpt.

The capability compared to sonnet/4o is like a potato and a car.

Search for 'LLM Leaderboard' and you can see for yourself. The 8b models do not even rank. They're generally not capable enough to use as a self hosted assistant.

replies(2): >>41857515 #>>41859155 #

14. afro88 ◴[16 Oct 24 09:16 UTC] No.41857089[source]▶

>>41856653 #

Can you list some real temporary automation needs you've fulfilled? The demo shows asking for facts about space. Lower param models seem to be not great as raw chat models, so I'm interested in what they are doing well for you in this context

replies(1): >>41863457 #

15. archerx ◴[16 Oct 24 10:11 UTC] No.41857395[source]▶

>>41856567 (TP) #

For me at least the biggest feature of some self hosted LLMs is that you can get it then to be “uncensored”, you can get them to tell you dirty jokes or have the bias removed with controversial and politically incorrect subjects. Basically you have a freedom you won’t get from most of the main providers.

replies(1): >>41857684 #

16. worldsayshi ◴[16 Oct 24 10:31 UTC] No.41857515{4}[source]▶

>>41857044 #

I really hope we can get sonnet like performance down to single consumer level GPU someone soon. Maybe the hardware will get there before the models.

replies(1): >>41861683 #

17. ndheebebe ◴[16 Oct 24 10:57 UTC] No.41857684[source]▶

>>41857395 #

And reliability. When Azure sends you the "censored output" status code it had basically failed and no retry is gonna help. And unless you are some corp you wont get approved for lifting the censoring.

18. williamcotton ◴[16 Oct 24 12:16 UTC] No.41858199[source]▶

>>41856567 (TP) #

I work with a lot of attorney’s eyes only documents and most protective orders do not allow for shipping off these files to a third-party.

19. Rick76 ◴[16 Oct 24 12:35 UTC] No.41858353[source]▶

>>41856567 (TP) #

I essentially use it like everyone else. I use it to search through my personal documents because I can control the token size and file embedding

20. lolinder ◴[16 Oct 24 14:04 UTC] No.41859155{4}[source]▶

>>41857044 #

> Search for 'LLM Leaderboard' and you can see for yourself. The 8b models do not even rank.

This is not true. On benchmarks, maybe, but I find the LLM Arena more accurately accounts for the subjective experience of using these things, and Llama 3.1 8B ranks relatively high, outperforming GPT-3.5 and certain iterations of 4.

Where the 8Bs do struggle is that they don't have as deep a repository of knowledge, so using them without some form of RAG won't get you as good results as using a plain larger model. But frankly I'm not convinced that RAG-free chat is the future anyway, and 8B models are extremely fast and cheap to run. Combined with good RAG they can do very well.

replies(1): >>41899615 #

21. cma ◴[16 Oct 24 17:15 UTC] No.41861443[source]▶

>>41856567 (TP) #

They are much more flexible, you can e.g. edit the system's own responses rather than waste context on telling it a correction.

22. TechDebtDevin ◴[16 Oct 24 17:41 UTC] No.41861683{5}[source]▶

>>41857515 #

Well considering it probably takes several hundred GBs of VRAM to run inference for Claude its going to be a while.

But yes, like the guy above said it's really only helpful for one line commands. Like if I forgot some sort flag thats available for a certain type of command. Or random things I don't work with often enough to memorize their little build commands etc. It's not helpful for programming just simple commands.

It also can help with unstructured or messy data to make it more readable, although there's potential to hallucinate if the context is at all large.

23. xyc ◴[16 Oct 24 20:23 UTC] No.41863457{3}[source]▶

>>41857089 #

Things like grab some markdown text and ask to make a pip/npm install one liner, or quick js scripts to paste in the console (which I didn't bother to open an editor), a fun use case was random drawing some lucky winners for the app giveaway from reddit usernames. Mostly it's converting unstructured text to short/one-liner executable scripts & doesn't require much intelligence. For more complex automation/scripts that I'll save for later, I do resort to providers (cursor w sonnet 3.5 mostly).

24. theodric ◴[16 Oct 24 22:37 UTC] No.41864562[source]▶

>>41856567 (TP) #

I've been enjoying fine-tuning various models with various data, for example 17 years of my own tweets, and then just cranking up the temperature and letting the model generate random crap that cracks me up. Is that practical? Is joy practical? I think there's a place for it.

25. m0wer ◴[19 Oct 24 19:55 UTC] No.41890288[source]▶

>>41856567 (TP) #

TabbyML! Autocompletion like GitHub Copilot, using qwen-2.5-coder 7B.

26. wokwokwok ◴[21 Oct 24 00:17 UTC] No.41899615{5}[source]▶

>>41859155 #

All I can say is my experience is that this is the difference between wanting something to be true, and it actually being true.

> 8B models are extremely fast and cheap to run

yes.

> Combined with good RAG they can do very well.

This is simply not true. They perform at a level which is useful for simple, trivial tasks.

If you consider that 'doing well', then sure.

However, if, like the parent post, you want to be writing scripts, which is specifically what they asked... then: heck, what 8B are you using, because llama 3.1 is shit at it out of the box.

¯\_(ツ)_/¯

A working unit test can take 6 or 7 iterations with a good prompt. Forget writing logic. Creating classes? Using RAG to execute functions from a spec? Forget it.

That's not not the level that I need for an assistant.

↑