Show HN: A private, flat monthly subscription for open-source LLMs

Show context

logicprog ◴[28 Aug 25 20:53 UTC] No.45056913[source]▶

I was literally just wishing there was something like this, this is perfect! Do you do prompt caching?

reissbaker ◴[28 Aug 25 21:09 UTC] No.45057067[source]▶

Aw thanks! We don't currently, but from a cost perspective as a user it shouldn't matter much since it's all bundled into the same subscription (we rate-limit by requests, not by tokens — our request rate limits are set to "higher than the amount of messages per hour that Claude Code promises", haha). We might at some point just to save GPUs though!

replies(1): >>45057965 #

logicprog ◴[28 Aug 25 22:55 UTC] No.45057965[source]▶

>>45057067 #

Yeah I wasn't worried so much about costs to me, as sustainability of your own prices — don't want to run into a "we're lowering quotas" situation like CC did :P

replies(1): >>45058150 #

reissbaker ◴[28 Aug 25 23:24 UTC] No.45058150[source]▶

>>45057965 #

Lol fair! I think we're safe for now; our most popular model (and my personal favorite coding model) is GLM-4.5, which fits on a ~relatively small node compared to the rumored sizes of Anthropic's models. We can throw a lot of tokens at it before running into issues — it's kind of nice to launch without prompt caching, since it means if we're flying too close to the sun on tokens we still have some pretty large levers left to pull on the infra side before needing to do anything drastic with rate limits.

replies(1): >>45058389 #

logicprog ◴[28 Aug 25 23:59 UTC] No.45058389[source]▶

>>45058150 #

> I think we're safe for now; our most popular model (and my personal favorite coding model) is GLM-4.5,

That's funny, that's also my favorite coding model as well!

> the rumored sizes of Anthropic's models

Yeah. I've long had a hypothesis that their models are, like, average sized for a SOTA model, but fully dense, like that old llama 3.1 405b model, and that's why their per token inference costs are insane compared to the competition.

> it's kind of nice to launch without prompt caching, since it means if we're flying too close to the sun on tokens we still have some pretty large levers left to pull on the infra side before needing to do anything drastic with rate limits.

That makes sense.

I'm poor as dirt, and my job actually forbids AI code in the main codebase, so I can't justify even a $20 per month prescription right now (especially when, for experimenting with agentic coding, qwen code is currently free (if shitty)) but when or if it becomes financially responsible, you will be at the very top of my list.

replies(1): >>45058648 #