To actually use a model, you need a context window. Realistically, you'll want a 20GB GPU or larger, depending on how many tokens you need.
Prompt Tokens: 10
Time: 229.089 ms
Speed: 43.7 t/s
Generation Tokens: 41
Time: 959.412 ms
Speed: 42.7 t/s
For some reason, it only uses around 7GB of VRAM, probably due to how the layers are scheduled, maybe I could tweak something there, but didn't bother just for testing.
Obviously, perf depends on CPU, GPU and RAM, but on my machine (3060 + i5-13500) it's around 2 t/s.
Best to have two or more low-end, 16GB GPUs for a total of 32GB VRAM to run most of the better local models.
If you want a bit more context, try -ctv q8 -ctk q8 (from memory so look it up) to quant the kv cache.
Also an imatrix gguf like iq4xs might be smaller with better quality