←back to thread

268 points Areibman | 1 comments | | HN request time: 0s | source

Hey HN! Tokencost is a utility library for estimating LLM costs. There are hundreds of different models now, and they all have their own pricing schemes. It’s difficult to keep up with the pricing changes, and it’s even more difficult to estimate how much your prompts and completions will cost until you see the bill.

Tokencost works by counting the number of tokens in prompt and completion messages and multiplying that number by the corresponding model cost. Under the hood, it’s really just a simple cost dictionary and some utility functions for getting the prices right. It also accounts for different tokenizers and float precision errors.

Surprisingly, most model providers don't actually report how much you spend until your bills arrive. We built Tokencost internally at AgentOps to help users track agent spend, and we decided to open source it to help developers avoid nasty bills.

Show context
refulgentis[dead post] ◴[] No.40714113[source]
[flagged]
Areibman ◴[] No.40714433[source]
This is unnecessarily harsh. Not every model has a publicly available tokenizer, and using a fallback like cl100k is usually a decent enough estimator from my experience.

Besides, there's a warning message for when you specify a model without a known tokenizer.

If you're upset with the implementation, you can always raise an issue or fix it yourself

replies(1): >>40714775 #
refulgentis ◴[] No.40714775[source]
> This is unnecessarily harsh.

Which part? All I can tease out from your comment are "the lies are impossible" (agreed!) and "close enough afaik". (it's not, the closest in the Big 5 has percent error of 32%, see end of comment. ex. GPT4o has a tokenizer with 2x the vocab so you'd expect ~1/2 the tokens)

> Not every model has a publicly available tokenizer,

Right. Ex. Claude 3s and Geminis. So why are Claude 3s and Geminis listed as supported models?

> using a fallback

CL100K isn't a fallback, its the only tokenizer.

> like cl100k is usually a decent enough estimator from my experience.

I'm very surprised to hear this, per stats demonstrating minimum error of 32%.

> If you're upset with the implementation, you can always raise an issue

I'm not "upset with the implementation", I'm sharing that the claims about being able to make financial calculations for 400 different LLMs is lying.

> or fix it yourself

How?

As you pointed out, its unfixable for at least some subset of the ones they're claiming, ex. Gemini and Claude 3s.

Let's pretend it was possible.

Why?

If someone puts out a library making wildly false claims, is th right thing to do to stay quiet and fix the library making false claims until its claims are true?

> usually a decent enough estimator

No, not for financial things certainly, which is the stated core purpose of the library.

As promised, data: I picked the simplest example from my unit tests because you won't believe the divergence on larger ones.

OpenAI (CL100K) - 18 in/1 out = 19.

Gemini 1.5 - 41 in/14 out = 55. (65% error)

Claude 3 - 21 in/4 out = 25. (24% error)

Llama 3 - 23 in/5 out = 28. (32% error)

Mistral - 10 in/3 out = 13. (46% error)

replies(2): >>40718239 #>>40718874 #
weird-eye-issue ◴[] No.40718239[source]
I can tell you've never actually built anything worthwhile
replies(1): >>40718285 #
refulgentis ◴[] No.40718285[source]
Lol. Drive by insult that's A) obviously wrong, and funnily enough, it's the attention to detail that got me there B) in service of caping for "33% error in financial calculations is actually fine"
replies(1): >>40718407 #
weird-eye-issue ◴[] No.40718407{3}[source]
That's fine, different people have different definitions of worthwhile
replies(1): >>40718421 #
1. refulgentis ◴[] No.40718421{4}[source]
I hope your day gets better!