Tokencost works by counting the number of tokens in prompt and completion messages and multiplying that number by the corresponding model cost. Under the hood, it’s really just a simple cost dictionary and some utility functions for getting the prices right. It also accounts for different tokenizers and float precision errors.
Surprisingly, most model providers don't actually report how much you spend until your bills arrive. We built Tokencost internally at AgentOps to help users track agent spend, and we decided to open source it to help developers avoid nasty bills.
Besides, there's a warning message for when you specify a model without a known tokenizer.
If you're upset with the implementation, you can always raise an issue or fix it yourself
Which part? All I can tease out from your comment are "the lies are impossible" (agreed!) and "close enough afaik". (it's not, the closest in the Big 5 has percent error of 32%, see end of comment. ex. GPT4o has a tokenizer with 2x the vocab so you'd expect ~1/2 the tokens)
> Not every model has a publicly available tokenizer,
Right. Ex. Claude 3s and Geminis. So why are Claude 3s and Geminis listed as supported models?
> using a fallback
CL100K isn't a fallback, its the only tokenizer.
> like cl100k is usually a decent enough estimator from my experience.
I'm very surprised to hear this, per stats demonstrating minimum error of 32%.
> If you're upset with the implementation, you can always raise an issue
I'm not "upset with the implementation", I'm sharing that the claims about being able to make financial calculations for 400 different LLMs is lying.
> or fix it yourself
How?
As you pointed out, its unfixable for at least some subset of the ones they're claiming, ex. Gemini and Claude 3s.
Let's pretend it was possible.
Why?
If someone puts out a library making wildly false claims, is th right thing to do to stay quiet and fix the library making false claims until its claims are true?
> usually a decent enough estimator
No, not for financial things certainly, which is the stated core purpose of the library.
As promised, data: I picked the simplest example from my unit tests because you won't believe the divergence on larger ones.
OpenAI (CL100K) - 18 in/1 out = 19.
Gemini 1.5 - 41 in/14 out = 55. (65% error)
Claude 3 - 21 in/4 out = 25. (24% error)
Llama 3 - 23 in/5 out = 28. (32% error)
Mistral - 10 in/3 out = 13. (46% error)