I’m curious about how good the performance with local LLMs is on ‘outdated’ hardware like the author’s 2060. I have a desktop with a 2070 super that it could be fun to turn into an “AI server” if I had the time…

replies(7): >>41856521 #>>41856558 #>>41856559 #>>41856609 #>>41856875 #>>41856894 #>>41857543 #

4. keriati1 ◴[16 Oct 24 07:29 UTC] No.41856491[source]▶

>>41855886 (OP) #

What model size is used here? How much memory does the GPU have?

replies(1): >>41857104 #

5. taosx ◴[16 Oct 24 07:34 UTC] No.41856521[source]▶

>>41856480 #

Last time I tried a local llm was about a year ago with a 2070S and 3950x and the performance was quite slow for anything beyond phi 3.5 and the small models quality feels worse than what some providers offer for cheap or free so it doesn't seem worth it with my current hardware.

Edit: I've loaded llama 3.1 8b instruct GGUF and I got 12.61 tok/sec and 80tok/sec for 3.2 3b.

6. magicalhippo ◴[16 Oct 24 07:40 UTC] No.41856558[source]▶

>>41856480 #

I've been playing with some LLMs like Llama 3 and Gemma on my 2080Ti. If it fits in GPU memory the inference speed is quite decent.

However I've found quality of smaller models to be quite lacking. The Llama 3.2 3B for example is much worse than Gemma2 9B, which is the one I found performs best while fitting comfortably.

Actual sentences are fine, but it doesn't follow prompts as well and it doesn't "understand" the context very well.

Quantization brings down memory cost, but there seems to be a sharp decline below 5 bits for those I tried. So a larger but heavily quantized model usually performs worse, at least with the models I've tried so far.

So with only 6GB of GPU memory I think you either have to accept the hit on inference speed by only partially offloading, or accept fairly low model quality.

Doesn't mean the smaller models can't be useful, but don't expect ChatGPT 4o at home.

That said if you got a beefy CPU then it can be reasonable to have it do a few of the layers.

Personally I found Gemma2 9B quantized to 6 bit IIRC to be quite useful. YMMV.

replies(1): >>41857653 #

7. whitefables ◴[16 Oct 24 07:40 UTC] No.41856559[source]▶

>>41856480 #

Here's how it looks like in real time: https://youtu.be/3vhJ6fNW8AI

replies(1): >>41856601 #

8. taosx ◴[16 Oct 24 07:41 UTC] No.41856567[source]▶

>>41855886 (OP) #

For the people who self-host LLMs at home: what use cases do you have?

Personally, I have some notes and bookmarks that I'd like to scrape, then have an LLM summarize, generate hierarchical tags, and store in a database. For the notes part at least, I wouldn't want to give them to another provider; even for the bookmarks, I wouldn't be comfortable passing my reading profile to anyone.

replies(11): >>41856653 #>>41856701 #>>41856881 #>>41856970 #>>41856992 #>>41857395 #>>41858199 #>>41858353 #>>41861443 #>>41864562 #>>41890288 #

9. thisguyagain ◴[16 Oct 24 07:48 UTC] No.41856601{3}[source]▶

>>41856559 #

What’d you use to record that? Looks really great.

replies(1): >>41856691 #

10. khafra ◴[16 Oct 24 07:49 UTC] No.41856609[source]▶

>>41856480 #

If you want to set up an AI server for your own use, it's exceedingly easy to install LM Studio and hit the "serve an API" button.

Testing performance this way, I got about 0.5-1.5 tokens per second with an 8GB 4bit quantized model on an old DL360 rack-mount server with 192GB RAM and 2 E5-2670 CPUs. I got about 20-50 tokens per second on my laptop with a mobile RTX 4080.

replies(1): >>41856694 #

11. _blk ◴[16 Oct 24 07:50 UTC] No.41856618[source]▶

>>41855886 (OP) #

Why disable LVM for a smoother reboot experience? For encryption I get it since you need a key to mount, but all my setups have LVM or ZFS and I'd say my reboots are smooth enough.

12. xyc ◴[16 Oct 24 07:54 UTC] No.41856653[source]▶

>>41856567 #

llama3.2 1b & 3b is really useful for quick tasks like creating some quick scripts from some text, then pasting them to execute as it's super fast & replaces a lot of temporary automation needs. If you don't feel like invest time into automation, sometimes you can just feed into an LLM.

This is one of the reason why recently I added floating chat to https://recurse.chat/ to quickly access local LLM.

Here's a demo: https://x.com/recursechat/status/1846309980091330815

replies(2): >>41856827 #>>41857089 #

13. satvikpendem ◴[16 Oct 24 07:56 UTC] No.41856662[source]▶

>>41855886 (OP) #

I love Coolify, used to use v3, anyone know how their v4 is going? I thought it was still a beta release from what I saw on GitHub.

replies(3): >>41856754 #>>41856812 #>>41857567 #

14. whitefables ◴[16 Oct 24 08:01 UTC] No.41856691{4}[source]▶

>>41856601 #

Screen studio

15. taosx ◴[16 Oct 24 08:02 UTC] No.41856694{3}[source]▶

>>41856609 #

LM studio is so nice, I'm up and running in 5 minutes. ty

16. segalord ◴[16 Oct 24 08:03 UTC] No.41856701[source]▶

>>41856567 #

I use it exclusively for users on my personal website to chat with my data. I've given the setup tools to have read access my files and data

replies(1): >>41856740 #

17. netdevnet ◴[16 Oct 24 08:11 UTC] No.41856729[source]▶

>>41855886 (OP) #

Am I right thinking that a self-hosted llama wouldn't have the kind restrictions ChatGPT has since it has no initial system prompt?

replies(4): >>41856734 #>>41856777 #>>41856779 #>>41856872 #

18. Kudos ◴[16 Oct 24 08:12 UTC] No.41856734[source]▶

>>41856729 #

Many protections are baked into the models themselves.

19. ragebol ◴[16 Oct 24 08:13 UTC] No.41856737[source]▶

>>41855886 (OP) #

Probably saves a bit on the gas bill for heating too

replies(3): >>41856985 #>>41856987 #>>41856994 #

20. netdevnet ◴[16 Oct 24 08:13 UTC] No.41856740{3}[source]▶

>>41856701 #

Is this not something that you can with non-hosted LLMs like ChatGPT? If you expose your data, it should be able to access it iirc

replies(1): >>41856978 #

21. whitefables ◴[16 Oct 24 08:14 UTC] No.41856754[source]▶

>>41856662 #

I'm using v4 beta in the blog post. Didn't try v3 so there's no point of comparison but I'm loving it so far!

It was so easy to get other non-AI stuffs running!

22. dtquad ◴[16 Oct 24 08:19 UTC] No.41856777[source]▶

>>41856729 #

All the self-hosted LLM and text-to-image models come with some restrictions trained into them [1]. However there are plenty of people who have made uncensored "forks" of these models where the restrictions have been "trained away" (mostly by fine-tuning).

You can find plenty of uncensored LLM models here:

https://ollama.com/library

[1]: I personally suspect that many LLMs are still trained on WebText, derivatives of WebText, or using synthetic data generated by LLMs trained on WebText. This might be why they feel so "censored":

>WebText was generated by scraping only pages linked to by Reddit posts that had received at least three upvotes prior to December 2017. The corpus was subsequently cleaned

The implications of so much AI trained on content upvoted by 2015-2017 redditors is not talked about enough.

replies(2): >>41856879 #>>41857328 #

23. exe34 ◴[16 Oct 24 08:19 UTC] No.41856779[source]▶

>>41856729 #

It has a sanitised output. You might want to look for "abliterated" models, where the general performance might drop a bit but the guard-rails have been diminished.

24. j12a ◴[16 Oct 24 08:23 UTC] No.41856812[source]▶

>>41856662 #

Coolify is quite nice, have been running some things with the v4 beta.

It reminds a bit of making web sites with a page builder. Easy to install and click around to get something running without thinking too much about it fairly quickly.

Problems are quite similar also, training wheels getting stuck in the woods more easily, hehe.

25. taosx ◴[16 Oct 24 08:26 UTC] No.41856827{3}[source]▶

>>41856653 #

Looks very nice, saved it for later. Last week, I worked on implementing always-on speech-to-text functionality for automating tasks. I've made significant progress, achieving decent accuracy, but I imposed some self-imposed constraints to implement certain parts from scratch to deliver a single binary deployable solution, which means I still have work to do (audio processing is new territory for me). However, I'm optimistic about its potential.

That being said, I think the more straightforward approach would be to utilize an existing library like https://github.com/collabora/WhisperLive/ within a Docker container. This way, you can call it via WebSocket and integrate it with my LLM, which could also serve as a nice feature in your product.

replies(1): >>41856926 #

26. nubinetwork ◴[16 Oct 24 08:35 UTC] No.41856872[source]▶

>>41856729 #

That depends on the frontend, you can supply a system prompt if you want to... whether it follows it to the letter is another problem...

27. dtquad ◴[16 Oct 24 08:36 UTC] No.41856875[source]▶

>>41856480 #

I am using an old laptop with a GTX 1060 6 GB VRAM to run a home server with Ubuntu and Ollama. Because of quantization Ollama can run 7B/8B models on an 8 year old laptop GPU with 6 GB VRAM.

28. nubinetwork ◴[16 Oct 24 08:36 UTC] No.41856879{3}[source]▶

>>41856777 #

> All the self-hosted [...] text-to-image models come with some restrictions trained into them

https://github.com/huggingface/diffusers/issues/3422

29. TechDebtDevin ◴[16 Oct 24 08:37 UTC] No.41856881[source]▶

>>41856567 #

I keep an 8b running with ollama/openwebui to ask it to format things, summarization, and to generate SQL/simple bash commands and what not.

replies(1): >>41856966 #

30. nubinetwork ◴[16 Oct 24 08:38 UTC] No.41856894[source]▶

>>41856480 #

I'm happy with a Radeon VII, unless the model is bigger than 16gb...

31. xyc ◴[16 Oct 24 08:43 UTC] No.41856926{4}[source]▶

>>41856827 #

Thanks! lmk when/if you wanna give it a spin as free trial hasn't been updated with the latest but I'll try to do it this week.

I've actually been playing around with speech to text recently. Thank you for the pointer, docker is a bit too heavy to deploy for desktop app use case but it's good to know about the repo. Building binaries with Pyinstaller could be an option though.

Real time transcription seems a bit complicated as it involves VAD so a feasible path for me is to first ship simple transcription with whisper.cpp. large-v3-turbo looks fast enough :D

replies(1): >>41856983 #

32. worldsayshi ◴[16 Oct 24 08:49 UTC] No.41856966{3}[source]▶

>>41856881 #

So 8b is really smart enough to write scripts for you? How often does it fail?

replies(1): >>41857044 #

33. laniakean ◴[16 Oct 24 08:50 UTC] No.41856970[source]▶

>>41856567 #

I mostly use it to write some quick scripts or generate texts if it follows some pattern. Also, getting it up running with LM studio is pretty straightforward.

34. worldsayshi ◴[16 Oct 24 08:51 UTC] No.41856978{4}[source]▶

>>41856740 #

You can absolutely do that but then you pay by the token instead of a big upfront hardware cost. It feels different I suppose. Sunk cost and all that.

35. taosx ◴[16 Oct 24 08:52 UTC] No.41856983{5}[source]▶

>>41856926 #

Yes it's fast enough, especially if you don't need something live.

36. szundi ◴[16 Oct 24 08:52 UTC] No.41856985[source]▶

>>41856737 #

If only we had heat-pump computers

replies(1): >>41857119 #

37. CraigJPerry ◴[16 Oct 24 08:53 UTC] No.41856987[source]▶

>>41856737 #

I don’t know, it’s kind of amazing how good the lighter weight self hosted models are now.

Given a 16gb system with cpu inference only, I’m hosting gemma2 9b at q8 for llm tasks and SDXL turbo for image work and besides the memory usage creeping up for a second or so while i invoke a prompt, they’re basically undetectable in the background.

38. ein0p ◴[16 Oct 24 08:54 UTC] No.41856992[source]▶

>>41856567 #

I run Mistral Large on 2xA6000. 9 times out of 10 the response is the same quality as GPT 4o. My employer does not allow the use of GPT for privacy related reasons. So I just use a private Mistral for that

39. rglullis ◴[16 Oct 24 08:54 UTC] No.41856994[source]▶

>>41856737 #

Snark aside, even in Germany (where electricity is very expensive) it is more economical to self host than to pay for a subscription to any of the commercial providers.

40. bambax ◴[16 Oct 24 08:59 UTC] No.41857017[source]▶

>>41855886 (OP) #

> I decided to explore self-hosting some of my non-critical applications

Self-hosting static or almost-static websites is now really easy with a Cloudflare front. I just closed my account on SmugMug and published my images locally using my NAS; this costs no extra money (is basically free) since the photos were already on the NAS, and the NAS is already powered on 24-7.

The NAS I use is an Asustor so it's not really Linux and you can't install what you want on it, but it has Apache, Python and PHP with Sqlite extension, which is more than enough for basic websites.

Cloudflare free is like magic. Response times are near instantaneous and setup is minimal. You don't even have to configure an SSL certificate locally, it's all handled for you and works for wildcard subdomains.

And of course if one puts a real server behind it, like in the post, anything's possible.

replies(3): >>41857088 #>>41857339 #>>41857709 #

41. wokwokwok ◴[16 Oct 24 09:03 UTC] No.41857044{4}[source]▶

>>41856966 #

> So 8b is really smart enough to write scripts for you?

Depends on the model, but in general, no.

...but it's fine for simple 1 liner commands like "how do I revert my commit?" or "rename these files to camelcase".

> How often does it fail?

Immediately and constantly if you ask anything hard.

An 8b model is not chat-gpt. The 3B model in the OP post is not chat-gpt.

The capability compared to sonnet/4o is like a potato and a car.

Search for 'LLM Leaderboard' and you can see for yourself. The 8b models do not even rank. They're generally not capable enough to use as a self hosted assistant.

replies(2): >>41857515 #>>41859155 #

42. seungwoolee518 ◴[16 Oct 24 09:13 UTC] No.41857074{3}[source]▶

>>41856420 #

Thanks, I was bit confused on install a CUDA Toolkit on the Host. (Because I don't install any software except Driver && Toolkit)

43. Reubend ◴[16 Oct 24 09:16 UTC] No.41857088[source]▶

>>41857017 #

Is the NAS exposed to the whole internet? Or did you find a clever way to get CloudFlare in front of it despite it just being local?

replies(3): >>41857124 #>>41858097 #>>41860131 #

44. afro88 ◴[16 Oct 24 09:16 UTC] No.41857089{3}[source]▶

>>41856653 #

Can you list some real temporary automation needs you've fulfilled? The demo shows asking for facts about space. Lower param models seem to be not great as raw chat models, so I'm interested in what they are doing well for you in this context

replies(1): >>41863457 #

45. thawab ◴[16 Oct 24 09:18 UTC] No.41857104[source]▶

>>41856491 #

he is using the 3b one, since it's the default when downloading it from ollama: https://ollama.com/library/llama3.2

46. ragebol ◴[16 Oct 24 09:20 UTC] No.41857119{3}[source]▶

>>41856985 #

I'd gladly run whatever model you want at home, rent it out so you can pay for both heating, the GPU and the power consumed :-)

47. cheema33 ◴[16 Oct 24 09:21 UTC] No.41857124{3}[source]▶

>>41857088 #

You can use CloudFlare Tunnel (https://www.cloudflare.com/products/tunnel/) to connect a system to your cloudflare gateway, without exposing it to the Internet.

replies(1): >>41857342 #

48. thrdbndndn ◴[16 Oct 24 09:57 UTC] No.41857328{3}[source]▶

>>41856777 #

My to-go test for uncensoring is to ask the LLM to write erotic novel.

But I haven't yet find any "uncensored" ones (on ollama) that works. Did I miss something?

(On the contrary: when ChatGPT first came out, it was trivial to jailbreak it to make it write erotica.)

replies(3): >>41857379 #>>41857399 #>>41857875 #

49. archerx ◴[16 Oct 24 10:01 UTC] No.41857339[source]▶

>>41857017 #

You could also use openVPN or wireguard and not have a man in the middle for no reason.

I have a VPN on a raspberry pi and with that I can connect to my self hosted cloud, dev/staging servers for projects, gitlab and etc when I’m not on my home network.

replies(2): >>41858280 #>>41859061 #

50. rmbyrro ◴[16 Oct 24 10:02 UTC] No.41857342{4}[source]▶

>>41857124 #

Or Tailscale, which is pretty cool piece of tech.

replies(1): >>41857938 #

51. archerx ◴[16 Oct 24 10:11 UTC] No.41857395[source]▶

>>41856567 #

For me at least the biggest feature of some self hosted LLMs is that you can get it then to be “uncensored”, you can get them to tell you dirty jokes or have the bias removed with controversial and politically incorrect subjects. Basically you have a freedom you won’t get from most of the main providers.

replies(1): >>41857684 #

52. dtquad ◴[16 Oct 24 10:11 UTC] No.41857399{4}[source]▶

>>41857328 #

Try the popular (pull count) dolphin models:

https://ollama.com/library/dolphin-mistral

53. eloycoto ◴[16 Oct 24 10:19 UTC] No.41857434[source]▶

>>41855886 (OP) #

I have something like this, and I'm super happy with anythingLLM, which allows me to add a custom board with my workspaces, RAG, etc.. I love it!

54. ossusermivami ◴[16 Oct 24 10:30 UTC] No.41857510[source]▶

>>41855886 (OP) #

ai generated blog post (or reworded, whatever) are kinda getting very irritating, like playing chess against the computer, it feel soulless

replies(1): >>41857594 #

55. worldsayshi ◴[16 Oct 24 10:31 UTC] No.41857515{5}[source]▶

>>41857044 #

I really hope we can get sonnet like performance down to single consumer level GPU someone soon. Maybe the hardware will get there before the models.

replies(1): >>41861683 #

56. alias_neo ◴[16 Oct 24 10:35 UTC] No.41857543[source]▶

>>41856480 #

You can get a relative idea here: https://developer.nvidia.com/cuda-gpus

I use a Tesla P4 for ML stuff at home, it's equivalent to a 1080 Ti, and has a score of 7.1. A 2070 (they don't list the "super") is a 7.5.

For reference, 4060 Ti, 4070 Ti, 4080 and 4090 are 8.9, which is the highest score for a gaming graphics card.

57. raybb ◴[16 Oct 24 10:38 UTC] No.41857567[source]▶

>>41856662 #

V4 beta is working well for me. Also the new core dev Coolify hired mentioned in a Tweet this week that they're fixing up lots of bugs to get ready for V4 stable.

58. hmcamp ◴[16 Oct 24 10:43 UTC] No.41857594[source]▶

>>41857510 #

How can you tell this post was ai generated? I’m curious.

59. magicalhippo ◴[16 Oct 24 10:54 UTC] No.41857653{3}[source]▶

>>41856558 #

Yes, gemma-2-9b-it-Q6_K_L is the one that works well for me.

I tried gemma-2-27b-it-Q4_K_L but it's not as good, despite being larger.

Using llama.cpp and models from here[1].

[1]: https://huggingface.co/bartowski

60. ndheebebe ◴[16 Oct 24 10:57 UTC] No.41857684{3}[source]▶

>>41857395 #

And reliability. When Azure sends you the "censored output" status code it had basically failed and no retry is gonna help. And unless you are some corp you wont get approved for lifting the censoring.

61. ghoomketu ◴[16 Oct 24 11:02 UTC] No.41857709[source]▶

>>41857017 #

> Cloudflare free is like magic

Cloudflare is pretty strict about the Html to media ratio and might suspend or terminate your account if you are serving too many images.

I've read far too many horror stories about this on hn only so please make sure what you're doing is allowed by their TOS.

replies(2): >>41857752 #>>41857898 #

62. cranberryturkey ◴[16 Oct 24 11:06 UTC] No.41857733[source]▶

>>41855886 (OP) #

How is coolify different than ollama? is it better? worse? I like ollama because I can pull models and it exposes a rest api to me. which is great for development

replies(2): >>41858072 #>>41858087 #

63. hdra ◴[16 Oct 24 11:09 UTC] No.41857752{3}[source]▶

>>41857709 #

do they ever publish an actual number on this? given the size of HTML documents v.s. images, I imagine its something thats something that can be exceeded very easily without knowing..

e.g. is running a personal photography website OK?

replies(1): >>41857931 #

64. TomK32 ◴[16 Oct 24 11:27 UTC] No.41857875{4}[source]▶

>>41857328 #

I found that "Don't censor your answer" works as intended and my self-hosted llm happily delivers smut.

65. telgareith ◴[16 Oct 24 11:31 UTC] No.41857898{3}[source]▶

>>41857709 #

Cloudflare removed that bit from their TOS entirely about a year ago now. Are you citing a more recent source?

PS: talking about Cloudflare being snappy when content is being served from a austore nas made me chuckle.

replies(1): >>41858655 #

66. telgareith ◴[16 Oct 24 11:37 UTC] No.41857931{4}[source]▶

>>41857752 #

Cloudflare removed those restrictions from the TOS 12+ months ago.

Take a look at if Cloudflare Pages + Cloudflare R2 meets the needs of your site.

I'd also recommend using cloudflare tunnels (under Zero Trust) rather than punching a hole in your firewall. For a number of reasons.

67. telgareith ◴[16 Oct 24 11:38 UTC] No.41857938{5}[source]▶

>>41857342 #

Tailscale is wireguard with advertising, a convenient UI, and a STUN/TURN server.

replies(2): >>41858385 #>>41858580 #

68. ◴[16 Oct 24 12:01 UTC] No.41858072[source]▶

>>41857733 #

69. grahamj ◴[16 Oct 24 12:03 UTC] No.41858087[source]▶

>>41857733 #

Might want to skim the article

replies(1): >>41858095 #

70. cranberryturkey ◴[16 Oct 24 12:04 UTC] No.41858095{3}[source]▶

>>41858087 #

i did. just realized its a totally different tool for deploying apps.

replies(1): >>41858562 #

71. bambax ◴[16 Oct 24 12:04 UTC] No.41858097{3}[source]▶

>>41857088 #

The web server of the nas is exposed to the Internet (port forwarding of 80 from the router to the nas); the rest of the nas is not exposed / not accessible from outside the LAN.

The images that are published are low-res versions copied to a directory on a partition accessible to the web server.

This is not the safest solution, as it does punch a hole in the lan... It's kind of an experiment... We'll see how it goes.

72. williamcotton ◴[16 Oct 24 12:16 UTC] No.41858199[source]▶

>>41856567 #

I work with a lot of attorney’s eyes only documents and most protective orders do not allow for shipping off these files to a third-party.

73. dweekly ◴[16 Oct 24 12:26 UTC] No.41858280{3}[source]▶

>>41857339 #

I believe the suggested setup was for making a site and images available to the public, for which hiding the origin behind Cloudflare seems a very good reason. Some public IP will need to have ports 443/80 open.

74. Rick76 ◴[16 Oct 24 12:35 UTC] No.41858353[source]▶

>>41856567 #

I essentially use it like everyone else. I use it to search through my personal documents because I can control the token size and file embedding

75. vincentclee ◴[16 Oct 24 12:38 UTC] No.41858378[source]▶

>>41855886 (OP) #

Instead of `watch -n 0.5 nvidia-smi` to track GPU usage. One can use `nvtop`

https://github.com/Syllo/nvtop

76. calgoo ◴[16 Oct 24 12:39 UTC] No.41858385{6}[source]▶

>>41857938 #

exactly, which means setting up a vps, generating certificates, setting up some type of monitoring to make sure the tunnel is working, etc. I agree that wireguard is the best option, if you have the time and knowledge, but for some dev people that just wants to put up a webpage with a few users, tailscale/cloudflare is a much easier system to maintain (especially as it handles ssl for you as well - to some degree...).

77. sorenjan ◴[16 Oct 24 12:42 UTC] No.41858410[source]▶

>>41855886 (OP) #

Can you use a selfhosted LLM that fits in 12 GB VRAM as a reasonable substitute for copilot in VSCode? And if so, can you give it documentation and other code repositories to make it better at a particular language and platform?

replies(1): >>41858520 #

78. ◴[16 Oct 24 12:45 UTC] No.41858431[source]▶

>>41855886 (OP) #

79. 0xedd ◴[16 Oct 24 12:53 UTC] No.41858520[source]▶

>>41858410 #

Technically, yes, but will yield poor results. We did it internally at big corp n+1 and it, frankly, blows. Other than menial tasks, it's good for nothing but a scout badge.

replies(1): >>41858761 #

80. grahamj ◴[16 Oct 24 12:57 UTC] No.41858562{4}[source]▶

>>41858095 #

fwiw that was my reaction to the title too :D

81. rmbyrro ◴[16 Oct 24 12:58 UTC] No.41858580{6}[source]▶

>>41857938 #

I'm aware they wrap OSS, but they made it very, very easy to adopt and maintain for a large chunk of potential users. This requires significant effort and should not be undervalued, in my opinion.

82. jgalt212 ◴[16 Oct 24 13:08 UTC] No.41858655{4}[source]▶

>>41857898 #

I think the OP meant once the resource was cached by Cloudflare. The first time served is not snappy.

83. tbrownaw ◴[16 Oct 24 13:22 UTC] No.41858761{3}[source]▶

>>41858520 #

Is that really that much worse than full copilot, though? When we tried it this past spring, it was really cool but not quite useful enough to actually stick with.

84. nirav72 ◴[16 Oct 24 13:55 UTC] No.41859061{3}[source]▶

>>41857339 #

That requires opening a firewall port on router. For some people, that might not be possible. Either due to ISP restrictions such as CGNAT. In those cases, they're better off using something like Tailscale.

85. lolinder ◴[16 Oct 24 14:04 UTC] No.41859155{5}[source]▶

>>41857044 #

> Search for 'LLM Leaderboard' and you can see for yourself. The 8b models do not even rank.

This is not true. On benchmarks, maybe, but I find the LLM Arena more accurately accounts for the subjective experience of using these things, and Llama 3.1 8B ranks relatively high, outperforming GPT-3.5 and certain iterations of 4.

Where the 8Bs do struggle is that they don't have as deep a repository of knowledge, so using them without some form of RAG won't get you as good results as using a plain larger model. But frankly I'm not convinced that RAG-free chat is the future anyway, and 8B models are extremely fast and cheap to run. Combined with good RAG they can do very well.

replies(1): >>41899615 #

86. shepherdjerred ◴[16 Oct 24 15:20 UTC] No.41860131{3}[source]▶

>>41857088 #

I've used Tailscale funnel which works quite well for this.

https://tailscale.com/kb/1223/funnel

87. cma ◴[16 Oct 24 17:15 UTC] No.41861443[source]▶

>>41856567 #

They are much more flexible, you can e.g. edit the system's own responses rather than waste context on telling it a correction.

88. TechDebtDevin ◴[16 Oct 24 17:41 UTC] No.41861683{6}[source]▶

>>41857515 #

Well considering it probably takes several hundred GBs of VRAM to run inference for Claude its going to be a while.

But yes, like the guy above said it's really only helpful for one line commands. Like if I forgot some sort flag thats available for a certain type of command. Or random things I don't work with often enough to memorize their little build commands etc. It's not helpful for programming just simple commands.

It also can help with unstructured or messy data to make it more readable, although there's potential to hallucinate if the context is at all large.

89. xyc ◴[16 Oct 24 20:23 UTC] No.41863457{4}[source]▶

>>41857089 #

Things like grab some markdown text and ask to make a pip/npm install one liner, or quick js scripts to paste in the console (which I didn't bother to open an editor), a fun use case was random drawing some lucky winners for the app giveaway from reddit usernames. Mostly it's converting unstructured text to short/one-liner executable scripts & doesn't require much intelligence. For more complex automation/scripts that I'll save for later, I do resort to providers (cursor w sonnet 3.5 mostly).

90. theodric ◴[16 Oct 24 22:37 UTC] No.41864562[source]▶

>>41856567 #

I've been enjoying fine-tuning various models with various data, for example 17 years of my own tweets, and then just cranking up the temperature and letting the model generate random crap that cracks me up. Is that practical? Is joy practical? I think there's a place for it.

91. m0wer ◴[19 Oct 24 19:55 UTC] No.41890288[source]▶

>>41856567 #

TabbyML! Autocompletion like GitHub Copilot, using qwen-2.5-coder 7B.

92. wokwokwok ◴[21 Oct 24 00:17 UTC] No.41899615{6}[source]▶

>>41859155 #

All I can say is my experience is that this is the difference between wanting something to be true, and it actually being true.

> 8B models are extremely fast and cheap to run

yes.

> Combined with good RAG they can do very well.

This is simply not true. They perform at a level which is useful for simple, trivial tasks.

If you consider that 'doing well', then sure.

However, if, like the parent post, you want to be writing scripts, which is specifically what they asked... then: heck, what 8B are you using, because llama 3.1 is shit at it out of the box.

¯\_(ツ)_/¯

A working unit test can take 6 or 7 iterations with a good prompt. Forget writing logic. Creating classes? Using RAG to execute functions from a spec? Forget it.

That's not not the level that I need for an assistant.

↑