←back to thread

602 points emrah | 4 comments | | HN request time: 0.702s | source
Show context
trebligdivad ◴[] No.43744014[source]
It seems pretty impressive - I'm running it on my CPU (16 core AMD 3950x) and it's very very impressive at translation, and the image description is very impressive as well. I'm getting about 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was previously using). It does tend to be a bit chatty unless you tell it not to be; pretty much everything it'll give you a 'breakdown' unless you tell it not to - so for traslation my prompt is 'Translate the input to English, only output the translation' to stop it giving a breakdown of the input language.
replies(2): >>43744070 #>>43747653 #
Havoc ◴[] No.43747653[source]
The upcoming qwen3 series is supposed to be MoE...likely to give better tk/s on CPU
replies(1): >>43748355 #
1. slekker ◴[] No.43748355[source]
What's MoE?
replies(2): >>43748381 #>>43749736 #
2. zamalek ◴[] No.43748381[source]
Mixture of Experts. Very broadly speaking, there are a bunch of mini networks (experts) which can be independently activated.
3. Havoc ◴[] No.43749736[source]
Mixture of experts like other guy said - everything gets loaded into mem but not every byte is needed to generate a token (unlike classic LLMs like gemma).

So for devices that have lots of mem but weaker processing power it can get you similar output quality but faster. So tends to do better on CPU and APU like setups

replies(1): >>43754539 #
4. trebligdivad ◴[] No.43754539[source]
I'm not even sure they're loading everything into memory for MoE; maybe they can get away with only the relevant experts being paged in.