(github.com)

210 points blackcat201 | 2 comments | 31 Oct 25 00:07 UTC | HN request time: 0.4s | source

Show context

textembedding ◴[31 Oct 25 08:22 UTC] No.45769528[source]▶

125 upvotes with 2 comments is kinda sus

replies(3): >>45769778 #>>45770249 #>>45770284 #

muragekibicho ◴[31 Oct 25 08:54 UTC] No.45769778[source]▶

Lots of model releases are like this. We can only upvote. We can't run the model on our personal computers. We can neither test their 'Efficient Attention' concept on our personal computers.

Honestly, it would take 24 hours just to download the 98 GB model if I wanted to try it out (assuming I had a card with 98 GB of ram).

replies(3): >>45770625 #>>45771629 #>>45771703 #

samus ◴[31 Oct 25 13:21 UTC] No.45771703[source]▶

>>45769778 #

We very much can, especially such a Mixture of Experts model with only 3B activated parameters.

With an RTX 3070 (7GB GRAB VRAM), 32 GB RAM and an SSD I can run such models at speeds tolerable for casual use.

replies(1): >>45772009 #

1. embedding-shape ◴[31 Oct 25 13:54 UTC] No.45772009[source]▶

>>45771703 #

How many tok/s are you getting (with any runtime) with either the Kimi-Linear-Instruct or Kimi-Linear-Base on your RTX 3070?

replies(1): >>45776165 #

2. samus ◴[31 Oct 25 20:09 UTC] No.45776165[source]▶

>>45772009 (TP) #

With a Qwen3-32B-A3B (Q8) I'm getting 10-20 t/sec on KoboldAI, e.g., llama cpp. Faster than I can read, so good enough for hobby use. I expect this model to be significantly faster, but llama.cpp-based software probably doesn't support it yet.

↑

Kimi Linear: An Expressive, Efficient Attention Architecture