Most active commenters

    ←back to thread

    210 points vincirufus | 15 comments | | HN request time: 2.244s | source | bottom
    1. chisleu ◴[] No.45145959[source]
    I've been using GLM 4.5 and GLM 4.5 Air for a while now. The Air model is light enough to run on a macbook pro and is useful for Cline. I can run the full GLM model on my Mac Studio, but the TPS is so slow that it's only useful for chatting. So I hooked up with openrouter to try but didn't have the same success. Any of the open weight models I try with open router give sub standard results. I get better results from Qwen 3 coder 30b a3b locally than I get from Qwen 3 Coder 480b through open router.

    I'm really concerned that some of the providers are using quantized versions of the models so they can run more models per card and larger batches of inference.

    replies(3): >>45145970 #>>45146999 #>>45149106 #
    2. vincirufus ◴[] No.45145970[source]
    yeah I too have heard similar concerns with Open models on OpenRouter, but haven't been able to verify it, as I don't use that a lot
    replies(1): >>45146140 #
    3. numlocked ◴[] No.45146140[source]
    (OpenRouter COO here) We are starting to test this and verify the deployments. More to come on that front -- but long story short is that we don't have good evidence that providers are doing weird stuff that materially affects model accuracy. If you have data points to the contrary, we would love them.

    We are heavily incentivized to prioritize/make transparent high-quality inference and have no incentive to offer quantized/poorly-performing alternatives. We certainly hear plenty of anecdotal reports like this, but when we dig in we generally don't see it.

    An exception is when a model is first released -- for example this terrific work by artificial analysis: https://x.com/ArtificialAnlys/status/1955102409044398415

    It does take providers time to learn how to run the models in a high quality way; my expectation is that the difference in quality will be (or already is) minimal over time. The large variance in that case was because GPT OSS had only been out for a couple of weeks.

    For well-established models, our (admittedly limited) testing has not revealed much variance between providers in terms of quality. There is some but it's not like we see a couple of providers 'cheating' by secretly quantizing and clearly serving less intelligence versions of the model. We're going to get more systematic about it though and perhaps will uncover some surprises.

    replies(3): >>45146251 #>>45146473 #>>45147084 #
    4. chandureddyvari ◴[] No.45146251{3}[source]
    Unsolicited advice: Why doesn’t open router provide hosting services for OSS models that guarantee non-quantised versions of the LLMs? Would be a win-win for everyone.
    replies(2): >>45146312 #>>45146437 #
    5. jatins ◴[] No.45146312{4}[source]
    In fact I thought that's what OpenRouter was hosting them all along
    6. jjani ◴[] No.45146437{4}[source]
    Would make very little business sense at this point - currently they have an effective monopoly on routing. Hosting would just make them one provider among a few dozen. It would make the other providers less likely to offer their services through openrouter. It would come with lots of concerns that openrouter would favor routing towards their own offerings. It would be a huge distraction to their core business which is still rapidly growing. Would need massive capital investment. And another thousand reasons I haven't thought of.
    7. indigodaddy ◴[] No.45146473{3}[source]
    So what's the deal with Chutes and all the throttling and errors. Seems like users are losing their minds over this.. at least from all the reddit threads I'm seeing
    replies(1): >>45147544 #
    8. KronisLV ◴[] No.45146999[source]
    > I get better results from Qwen 3 coder 30b a3b locally than I get from Qwen 3 Coder 480b through open router. I'm really concerned that some of the providers are using quantized versions of the models so they can run more models per card and larger batches of inference.

    This doesn't match my experience precisely, but I've definitely had cases where some of the providers had consistently worse output for the same model than others, the solution there was to figure out which ones those are and to denylist them in the UI.

    As for quantized versions, you can check it for each model and provider, for example: https://openrouter.ai/qwen/qwen3-coder/providers

    You can see that these providers run FP4 versions:

      * DeepInfra (Turbo)
    
    And these providers run FP8 versions:

      * Chutes
      * GMICloud
      * NovitaAI
      * Baseten
      * Parasail
      * Nebius AI Studio
      * AtlasCloud
      * Targon
      * Together
      * Hyperbolic
      * Cerebras
    
    I will say that it's not all bad and my experience with FP8 output has been pretty decent, especially when I need something done quickly and choose to use Cerebras - provided their service isn't overloaded, their TPS is really, really good.

    You can also request specific precision on a per request basis: https://openrouter.ai/docs/features/provider-routing#quantiz... (or just make a custom preset)

    replies(1): >>45147398 #
    9. blitzar ◴[] No.45147084{3}[source]
    > We ... have no incentive to offer quantized/poorly-performing alternatives

    However your providers do have such an incentive.

    10. snthpy ◴[] No.45147398[source]
    Interesting. Thanks for sharing. What about qwen3-coder on Cerebras? I'm happy to pay the $50 for the speed as long as results are good. How does it compare with glm-4.5?
    replies(1): >>45147831 #
    11. typpilol ◴[] No.45147544{4}[source]
    What's chutes?
    replies(1): >>45147645 #
    12. arcanemachiner ◴[] No.45147645{5}[source]
    Cheap provider on OpenRouter:

    https://openrouter.ai/provider/chutes

    replies(1): >>45151585 #
    13. KronisLV ◴[] No.45147831{3}[source]
    I wish that Cerebras had a direct pay per use API option instead of pushing you towards OpenRouter and HuggingFace (the former sometimes throws 429, so either the speed is great, or there is no speed): https://www.cerebras.ai/pricing but I imagine that for most folks their subscription would be more than enough!

    As for how Qwen3 Coder performs, there's always SWE-bench: https://www.swebench.com/

    By the numbers:

      * it sits between Gemini 2.5 Pro and GPT-5 mini
      * it beats out Kimi K2 and the older Claude Sonnet 3.7
      * but loses out to Claude Sonnet 4 and GPT-5
    
    Personally, I find it sufficient for most tasks (from recommendations and questions to as close to vibe coding as I get) on a technical level. GLM 4.5 isn't on the site at the time of writing this, but they should match one another pretty closely. Feeling wise, I still very much prefer Sonnet 4 to everything else, but it's both expensive and way slower than Cerebras (not even close).

    Update: also seems like the Growth plan on their page says "Starting from 1500 USD / month" which is a bit silly when the new cheapest subscription is 50 USD / month.

    14. jbellis ◴[] No.45149106[source]
    Quantization matters a lot more than r/locallama wants to believe. Here's Qwen3 Coder vs Qwen3 Coder @fp8: https://brokk.ai/power-ranking?version=openround-2025-08-20&...
    15. typpilol ◴[] No.45151585{6}[source]
    Ahh. Thanks