Kimi Linear: An Expressive, Efficient Attention Architecture

1. textembedding ◴[31 Oct 25 08:22 UTC] No.45769528[source]▶

125 upvotes with 2 comments is kinda sus

replies(3): >>45769778 #>>45770249 #>>45770284 #

2. muragekibicho ◴[31 Oct 25 08:54 UTC] No.45769778[source]▶

Lots of model releases are like this. We can only upvote. We can't run the model on our personal computers. We can neither test their 'Efficient Attention' concept on our personal computers.

Honestly, it would take 24 hours just to download the 98 GB model if I wanted to try it out (assuming I had a card with 98 GB of ram).

replies(3): >>45770625 #>>45771629 #>>45771703 #

3. WhereIsTheTruth ◴[31 Oct 25 10:05 UTC] No.45770249[source]▶

>>45769528 (TP) #

The Chinese century ain't gonna build itself /s

4. actionfromafar ◴[31 Oct 25 10:09 UTC] No.45770284[source]▶

>>45769528 (TP) #

I'm hoping someone will explain what this release even means.

replies(1): >>45770892 #

5. danielbln ◴[31 Oct 25 10:58 UTC] No.45770625[source]▶

>>45769778 #

Your have a 9Mbit downlink? I'm not sure you're trying much of anything with that Internet connection, no offense.

6. lhl ◴[31 Oct 25 11:37 UTC] No.45770892[source]▶

>>45770284 #

We do live in an age of frontier LLMs... For fun, I'll just use Kimi K2 (on Kagi Assistant).

> Can you explain what this means and its significance? Assume that I'm a layperson with no familiarity with LLM jargon so explain all of the technical terms, references, names. https://github.com/MoonshotAI/Kimi-Linear

Imagine your brain could only “look at” a few words at a time when you read a long letter. Today’s big language models (the AI that powers chatbots) have the same problem: the longer the letter gets, the more scratch paper they need to keep track of it all. That scratch paper is called the “KV cache,” and for a 1 000 000-word letter it can fill a small library.

Kimi Linear is a new way for the AI to read and write that throws away most of that scratch paper yet still understands the letter. It does this by replacing the usual “look at every word every time” trick (full attention) with a clever shortcut called linear attention. The shortcut is packaged into something they call Kimi Delta Attention (KDA).

What the numbers mean in plain English

    51.0 on MMLU-Pro: on a 4 000-word school-test set, the shortcut scores about as well as the old, slow method.
    84.3 on RULER at 128 000 words: on a much longer test it keeps the quality high while running almost four times faster.
    6 × faster TPOT: when the AI is writing its reply, each new word appears up to six times sooner than with the previous best shortcut (MLA).
    75 % smaller KV cache: the scratch paper is only one-quarter the usual size, so you can fit longer conversations in the same memory.

Key pieces explained

    Full attention: the old, accurate but slow “look back at every word” method.
    KV cache: the scratch paper that stores which words were already seen.
    Linear attention: a faster but traditionally weaker way of summarising what was read.
    Gated DeltaNet: an improved linear attention trick that keeps the most useful bits of the summary.
    Kimi Delta Attention (KDA): Moonshot’s even better version of Gated DeltaNet.
    Hybrid 3:1 mix: three layers use the fast KDA shortcut, one layer still uses the old reliable full attention, giving speed without losing smarts.
    48 B total, 3 B active: the model has 48 billion total parameters but only 3 billion “turn on” for any given word, saving compute.
    Context length 1 M: it can keep track of about 1 000 000 words in one go—longer than most novels.

Bottom line Kimi Linear lets an AI read very long documents or hold very long conversations with far less memory and much less waiting time, while still giving answers as good as—or better than—the big, slow models we use today.

7. Der_Einzige ◴[31 Oct 25 13:13 UTC] No.45771629[source]▶

>>45769778 #

People here absolutely can afford the ~2 dollars an hour of cloud rental costs for an H100 or even 8 (OCI has cheap H100 nodes). Most people are too lazy to even try and thank goodness for it because I prefer my very high salaries as someone who isn’t too lazy to spin up a cloud instance.

replies(1): >>45772029 #

8. samus ◴[31 Oct 25 13:21 UTC] No.45771703[source]▶

>>45769778 #

We very much can, especially such a Mixture of Experts model with only 3B activated parameters.

With an RTX 3070 (7GB GRAB VRAM), 32 GB RAM and an SSD I can run such models at speeds tolerable for casual use.

replies(1): >>45772009 #

9. embedding-shape ◴[31 Oct 25 13:54 UTC] No.45772009{3}[source]▶

>>45771703 #

How many tok/s are you getting (with any runtime) with either the Kimi-Linear-Instruct or Kimi-Linear-Base on your RTX 3070?

replies(1): >>45776165 #

10. embedding-shape ◴[31 Oct 25 13:56 UTC] No.45772029{3}[source]▶

>>45771629 #

Not to mention some of us have enough disposable income to buy a RTX Pro 6000 so we can run our stuff locally and finally scale up our model training a little bit.

11. samus ◴[31 Oct 25 20:09 UTC] No.45776165{4}[source]▶

>>45772009 #

With a Qwen3-32B-A3B (Q8) I'm getting 10-20 t/sec on KoboldAI, e.g., llama cpp. Faster than I can read, so good enough for hobby use. I expect this model to be significantly faster, but llama.cpp-based software probably doesn't support it yet.