Extending the context length to 1M tokens

(qwenlm.github.io)

116 points cmcconomy | 5 comments | 18 Nov 24 16:27 UTC | HN request time: 0.001s | source

Show context

aliljet ◴[18 Nov 24 18:05 UTC] No.42175062[source]▶

This is fantastic news. I've been using Qwen2.5-Coder-32B-Instruct with Ollama locally and it's honestly such a breathe of fresh air. I wonder if any of you have had a moment to try this newer context length locally?

BTW, I fail to effectively run this on my 2080 ti, I've just loaded up the machine with classic RAM. It's not going to win any races, but as they say, it's not the speed that matter, it's the quality of the effort.

replies(3): >>42175226 #>>42176314 #>>42177831 #

1. notjulianjaynes ◴[18 Nov 24 18:17 UTC] No.42175226[source]▶

>>42175062 #

Hi, are you able to use Qwen's 128k context length with Ollama? Using AnythingLLM + Ollamma and a GGUF version I kept getting an error message with prompts longer than 32,000 tokens. (summarizing long transcripts)

replies(1): >>42175335 #

2. syntaxing ◴[18 Nov 24 18:25 UTC] No.42175335[source]▶

>>42175226 (TP) #

The famous Daniel Chen (same person that made Unsloth and fixed Gemini/LLaMa bugs) mentioned something about this on reddit and offered a fix. https://www.reddit.com/r/LocalLLaMA/comments/1gpw8ls/bug_fix...

replies(2): >>42175727 #>>42175742 #

3. zargon ◴[18 Nov 24 19:03 UTC] No.42175727[source]▶

>>42175335 #

After reading a lot of that thread, my understanding is that yarn scaling is disabled intentionally by default in the GGUFs, because it would degrade outputs for contexts that do fit in 32k. So the only change is enabling yarn scaling at 4x, which is just a configuration setting. GGUF has these configuration settings embedded in the file format for ease of use. But you should be able to override them without downloading an entire duplicate set of weights (12 to 35 GB!). (It looks like in llama.cpp the override-kv option can be used for this, but I haven't tried it yet.)

replies(1): >>42175997 #

4. notjulianjaynes ◴[18 Nov 24 19:05 UTC] No.42175742[source]▶

>>42175335 #

Yeah unfortunately that's the exact model I'm using (Q5 version. What I've been doing is first loading the transcript into the vector database, and then giving it a prompt thats like "summarize the transcript below: <full text of transcript>". This works surprisingly well except for one transcript I had which was of a 3 hour meeting that was per an online calculator about 38,000 tokens. Cutting the text up into 3 parts and pretending each was a seperate meeting* lead to a bunch of hallucinations for some reason.

*In theory this shouldn't matter much for my purpose of summarizing city council meetings that follow a predictable format.

5. syntaxing ◴[18 Nov 24 19:30 UTC] No.42175997{3}[source]▶

>>42175727 #

Oh super interesting, I didn’t know you can override this with a flag on llama.cpp.

↑