On both i have setup lemonade-server on system start. At work i use Qwen3 Coder 30B-3A with continue.dev. It serves me well in 90% of cases.
At home i have 128GB RAM. I try a bit GPT120B. I host Open WebUI on it and connect via https and wireguard to it, so i can use it as PWA on my phone. I love not needing to think about where my data goes. But i would like to allow parallel requests, so i need to tinker a bit more. Maybe llama-swap is enough.
I just need to see how to deal with context length. My models stop or go into infinite loop after some messages. But then i often start a new chat.
Lemonade-server runs with llama.cpp, vllm seems to be scaling better thoug, but is not so easy to set up.
Unsloth GGUFs are great resource for models.
Also for Strix Halo check out kyuz0 repositorIES! Also has image gen. I didnt try those yet. But the benchmarks are awesome! Lots to learn from. Framework forum can be useful, too.
https://github.com/kyuz0/amd-strix-halo-toolboxes Also nice: https://llm-tracker.info/ It links to some benchmark site with models by size. I prefer such resources, since it is quite easy to see which one fit in my RAM (even though i have this silly thumbrule Billion Token ≈ GB RAM).
Btw. even a AMD HX 370 with non soldered RAM can get some nice t/s for smaller models. Can be helpful enough when disconnected from internet and you dont know how to style a svg :)
Thanks for opening up this topic! Lots of food :)