Most active commenters
  • refulgentis(7)
  • Areibman(5)
  • pamelafox(4)
  • J_Shelby_J(4)
  • simonw(3)
  • weird-eye-issue(3)
  • hansvm(3)

268 points Areibman | 78 comments | | HN request time: 2.167s | source | bottom

Hey HN! Tokencost is a utility library for estimating LLM costs. There are hundreds of different models now, and they all have their own pricing schemes. It’s difficult to keep up with the pricing changes, and it’s even more difficult to estimate how much your prompts and completions will cost until you see the bill.

Tokencost works by counting the number of tokens in prompt and completion messages and multiplying that number by the corresponding model cost. Under the hood, it’s really just a simple cost dictionary and some utility functions for getting the prices right. It also accounts for different tokenizers and float precision errors.

Surprisingly, most model providers don't actually report how much you spend until your bills arrive. We built Tokencost internally at AgentOps to help users track agent spend, and we decided to open source it to help developers avoid nasty bills.

1. simonw ◴[] No.40710871[source]
I don't understand how the Claude functionality works.

As far as I know Anthropic haven't released the tokenizer for Claude - unlike OpenAI's tiktoken - but your tool lists the Claude 3 models as supported. How are you counting tokens for those?

replies(3): >>40710980 #>>40711374 #>>40718095 #
2. dudeinhawaii ◴[] No.40710980[source]
It's open source so you can take a look (I'm not the author): https://github.com/AgentOps-AI/tokencost/blob/main/tokencost...

It looks like tiktoken is the default for most of the methods.

Disclaimer: I didn't fully trace which are being used in each case/model.

replies(2): >>40711095 #>>40711161 #
3. yelnatz ◴[] No.40711000[source]
Can you do a column and normalize them?

Too many zeroes for my blind ass making it hard to compare.

replies(1): >>40711040 #
4. ryaneager ◴[] No.40711040[source]
Yeah a Tokens per $1 column would vastly help the readability.
replies(1): >>40711163 #
5. ilaksh ◴[] No.40711050[source]
Nice. Any plans to add calculations for image input for the models that allow that?
replies(2): >>40711455 #>>40718070 #
6. Ilasky ◴[] No.40711063[source]
I dig it! Kind of related, but I made a comparison of LLM API costs vs their leaderboard performance to gauge which models can be more bang for the buck [0]

[0] https://llmcompare.net

replies(2): >>40711502 #>>40714327 #
7. refibrillator ◴[] No.40711095{3}[source]
> # TODO: Add Claude support

There are no cases for Claude models yet.

I wonder if anyone has run a bunch of messages through Anthropic's API and used the returned token count to approximate the tokenizer?

replies(1): >>40718364 #
8. simonw ◴[] No.40711161{3}[source]
Yeah, I asked here because I dug around in the code and couldn't see how they were doing this, wanted to check I hadn't missed something.
9. qeternity ◴[] No.40711163{3}[source]
$/million tokens is the standard pricing metric.
replies(1): >>40713082 #
10. Lerc ◴[] No.40711341[source]
With all the options there seems like an opportunity for a single point API that can take a series of prompts, a budget and a quality hint to distribute batches for most bang for buck.

Maybe a small triage AI to decide how effectively models handle certain prompts to preserve spending for the difficult tasks.

Does anything like this exist yet?

replies(3): >>40712921 #>>40715521 #>>40715879 #
11. Areibman ◴[] No.40711374[source]
Anthropic actually has a Claude 3 tokenizer tucked away in one of their repos: https://github.com/anthropics/anthropic-tokenizer-typescript

At this moment, Tokencost uses the OpenAI tokenizer as a default tokenizer, but this would be a welcome PR!

replies(1): >>40711504 #
12. Areibman ◴[] No.40711455[source]
Perhaps at some point! Right now, we haven't been seeing most demand on the language side of things as multi-modal image really hasn't popped off yet
13. SubiculumCode ◴[] No.40711502[source]
Sure makes the case for Gemini Pro, doesn't it.
14. simonw ◴[] No.40711504{3}[source]
"This package can be used to count tokens for Anthropic's older models. As of the Claude 3 models, this algorithm is no longer accurate [...]"

I've been bugging Anthropic about this for a while, they said that releasing a new tokenizer is not on their current roadmap.

replies(1): >>40711734 #
15. throwaway211 ◴[] No.40711734{4}[source]
Imagine a coffee shop refusing to have a price list until after the coffee's been made.
replies(3): >>40712333 #>>40715060 #>>40717521 #
16. Karrot_Kream ◴[] No.40712031[source]
A whole bunch of the costs are listed as zeroes, with multiple decimal points. I noticed y'all used the Decimal library and tried to hold onto precision so I'm not sure what's going on, but certainly some of the cheaper models just show up as "free".
17. kevindamm ◴[] No.40712333{5}[source]
In many countries a taxi won't tell you how much the ride will cost. The first time I traveled to somewhere that negotiated the cost up front it blew my mind.

Frequently, contracts will have room for additional charges if circumstances change even a little, or products will have a market rate (fish, equity, etc.).

It might seem absurd but variable cost things are not uncommon.

replies(5): >>40712570 #>>40714914 #>>40715972 #>>40720216 #>>40777992 #
18. jacobglowbom ◴[] No.40712467[source]
Nice. Does it add Vision costs too?
replies(1): >>40718073 #
19. MattGaiser ◴[] No.40712570{6}[source]
In Oman, every tax fare was basically just of the bills. So 5 rials, 10 rials, or 20 rials. So many different ways people price things around the world.
20. zackfield ◴[] No.40712600[source]
Very cool! Is this cost directory you're using the best source for historical cost per 1M tokens? https://github.com/BerriAI/litellm/blob/main/model_prices_an...
replies(1): >>40712982 #
21. curious_cat_163 ◴[] No.40712921[source]
I have yet to find a use case where quality can be traded off.

Would love to hear what you had in mind.

replies(3): >>40713311 #>>40713472 #>>40788282 #
22. Areibman ◴[] No.40712982[source]
Best one I've found out there. If there's another, very open to replacing
23. Tao3300 ◴[] No.40713082{4}[source]
standard ≠ good
replies(2): >>40713795 #>>40717296 #
24. mvdtnz ◴[] No.40713311{3}[source]
Every single use case of LLMs inherently sacrifices quality, whether the developers are willing to admit it or not. I agree with you though that there aren't many use cases where end users would knowingly accept the trade off.
25. Lerc ◴[] No.40713472{3}[source]
It is not so much a drop in quality as there are tasks that every model above a certain threshold will perform equally.

Most can do 2+2 = 4.

One test prompt I use on LLMs is asking it to produce a JavaScript function that takes an ImageData object and returns a new ImageData object with an all direction Sobel edge detection. Quite a lot of even quite small models can generate functions like this.

In general, I don't even think this is a question that needs to be answered. A lot of API providers have different quality/price tiers. The fact that people are using the different tiers should be sufficient to show that at least some people are finding cases where cheaper models are good enough.

26. andenacitelli ◴[] No.40713795{5}[source]
Yes, but it’s true in the general case. Defaults are usually the defaults for a reason — someone putting thought into what makes sense for most users.
replies(1): >>40713808 #
27. Tao3300 ◴[] No.40713808{6}[source]
Not necessarily so when you're trying to sell stuff...
28. visarga ◴[] No.40714327[source]
Nice, can you make a triangle with (cost, performance, speed)? That would show the tradeoffs.
29. visarga ◴[] No.40714335[source]
You don't like a repo, you don't use it. Stop shaming people for their open source repos.
replies(1): >>40714397 #
30. oopsallmagic ◴[] No.40714390[source]
Can we get conversions for kg of CO2 emitted, too?
replies(1): >>40714663 #
31. oopsallmagic ◴[] No.40714397{3}[source]
When did the software community get so bad at handling legitimate critique?
replies(1): >>40714847 #
32. Areibman ◴[] No.40714433[source]
This is unnecessarily harsh. Not every model has a publicly available tokenizer, and using a fallback like cl100k is usually a decent enough estimator from my experience.

Besides, there's a warning message for when you specify a model without a known tokenizer.

If you're upset with the implementation, you can always raise an issue or fix it yourself

replies(1): >>40714775 #
33. _flux ◴[] No.40714663[source]
It would be nice, but how do we get this information required for the conversion?
34. jaredliu233 ◴[] No.40714720[source]
wow, this is really useful!! Just the price list alone has given me a lot of inspiration, thank you
35. refulgentis ◴[] No.40714775{3}[source]
> This is unnecessarily harsh.

Which part? All I can tease out from your comment are "the lies are impossible" (agreed!) and "close enough afaik". (it's not, the closest in the Big 5 has percent error of 32%, see end of comment. ex. GPT4o has a tokenizer with 2x the vocab so you'd expect ~1/2 the tokens)

> Not every model has a publicly available tokenizer,

Right. Ex. Claude 3s and Geminis. So why are Claude 3s and Geminis listed as supported models?

> using a fallback

CL100K isn't a fallback, its the only tokenizer.

> like cl100k is usually a decent enough estimator from my experience.

I'm very surprised to hear this, per stats demonstrating minimum error of 32%.

> If you're upset with the implementation, you can always raise an issue

I'm not "upset with the implementation", I'm sharing that the claims about being able to make financial calculations for 400 different LLMs is lying.

> or fix it yourself

How?

As you pointed out, its unfixable for at least some subset of the ones they're claiming, ex. Gemini and Claude 3s.

Let's pretend it was possible.

Why?

If someone puts out a library making wildly false claims, is th right thing to do to stay quiet and fix the library making false claims until its claims are true?

> usually a decent enough estimator

No, not for financial things certainly, which is the stated core purpose of the library.

As promised, data: I picked the simplest example from my unit tests because you won't believe the divergence on larger ones.

OpenAI (CL100K) - 18 in/1 out = 19.

Gemini 1.5 - 41 in/14 out = 55. (65% error)

Claude 3 - 21 in/4 out = 25. (24% error)

Llama 3 - 23 in/5 out = 28. (32% error)

Mistral - 10 in/3 out = 13. (46% error)

replies(2): >>40718239 #>>40718874 #
36. fbdab103 ◴[] No.40714847{4}[source]
"You should really be ashamed of yourself for doing this" is an inappropriate response for basically anything but kicking puppies.
replies(1): >>40714946 #
37. _flux ◴[] No.40714914{6}[source]
In this case there's nothing that's variable, though, and the competition is able to pull it off precisely. Indeed, they themselves were able to do it before!
replies(1): >>40727334 #
38. anoncareer0212 ◴[] No.40714946{5}[source]
That sounds somewhat specious, lying about what you support in your cost calculation library is a pretty big oof. It's hard to rank it versus kicking puppys, but, I don't think we have to stack-rank bad things to figure out if it's okay to call out unethical behavior.
39. itake ◴[] No.40715060{5}[source]
reminds me of the coffee shop incident in seattle last week with the hammer
40. codewithcheese ◴[] No.40715521[source]
That's openrouter, they are listed
41. michaelt ◴[] No.40715879[source]
Generally, if you've got a task big enough that you're worried about pricing, it's probably going to involve thousands of API calls.

In that case you might as well make ~20 API calls to each LLM under consideration, and evaluate the results yourself.

It's far easier to evaluate a model's performance on a given prompt by looking at the output than by looking at the input alone.

42. scoot ◴[] No.40715972{6}[source]
> In many countries a taxi won't tell you how much the ride will cost.

I've only ever seen: fixed price based on destination (typically for fares originating from an airport), negotiated, or metered. A better analog analogy would be metered pricing, but where the cost per mile is a secret.

replies(1): >>40749171 #
43. yumaueno ◴[] No.40717022[source]
What a nice product! I think the way to count tokens depends on the language, but is this only supported in English?
replies(1): >>40717214 #
44. lgessler ◴[] No.40717214[source]
Most LLMs determine their token inventories by using byte-pair encoding, which algorithmically induces sub-word tokens from a body of text. So even in English you might see a word like "proselytization" tokenized apart into "_pro", "selyt", "iz", "ation", and non-English languages will probably (depending on their proportional representation in the training corpus) also receive token allocations in the BPE vocabulary.

Here's actual output from the GPT-4o tokenizer for English and Hindi:

    >>> [enc.decode([x]) for x in enc.encode("proselytization")]
    ['pros', 'ely', 't', 'ization']
    >>> [enc.decode([x]) for x in enc.encode("पर्यावरणवाद")]
    ['पर', '्य', 'ावरण', 'वाद']
45. ynniv ◴[] No.40717296{5}[source]
"Standard" doesn't imply "good", but that doesn't mean "non-standard" is better. Cost-per-quantity (L/100km) is easier to compare than quantity-per-cost (MPG) because how much you use isn't going to change based on the model. Which is to say, if two local models are both $0.00 per million tokens, they effectively have the same cost. You could argue that you might get better results by throwing out more tokens, but the solution is to add more significant digits to the price per unit.
46. amanda99 ◴[] No.40717521{5}[source]
I mean you already pay per output token and (sure, you can limit it), but it's unpredictable given the prompt?
47. sakex ◴[] No.40717573[source]
An interesting parameter that I don't read about a lot is vocab size. A larger vocab means you will need to generate less tokens for the same word on average, also the context window will be larger. This means that a model with a large vocab might be more expensive on a per token basis, but would generate less tokens for the same sentence, making it cheaper overall. This should be taken into consideration when comparing API prices.
replies(3): >>40718221 #>>40720767 #>>40720776 #
48. pamelafox ◴[] No.40718055[source]
Are you also accounting for costs of sending images and function calls? I didn't see that when I looked through the code. I developed this package so that I could count those sorts of calls as well: https://github.com/pamelafox/openai-messages-token-helper
49. pamelafox ◴[] No.40718070[source]
I have a package here that includes calculation for images for OpenAI: https://github.com/pamelafox/openai-messages-token-helper
50. pamelafox ◴[] No.40718073[source]
I have a package here that includes calculation for images for OpenAI: https://github.com/pamelafox/openai-messages-token-helper
51. pamelafox ◴[] No.40718093[source]
I grappled with that issue for https://github.com/pamelafox/openai-messages-token-helper as I wanted to be able to use it for a quick token check with SLMs as well, so I ended up adding a parameter "fallback_to_default" for developers to indicate they're okay with assuming gpt-35 BPE encoding.
52. J_Shelby_J ◴[] No.40718095[source]
Here you go https://github.com/javirandor/anthropic-tokenizer
53. J_Shelby_J ◴[] No.40718153[source]
I’m not sure if the python tiktoken library has the cl200k tokenizer for gpt-4o, but I would imagine it does. So this library does support gpt-4o at least.
replies(1): >>40718266 #
54. J_Shelby_J ◴[] No.40718195[source]
Would anybody be interested in this for Rust? I already do everything this library does with the exception of returning the price in my LLM utils crate [1]. I do this just to count tokens to ensure prompts stay within limits. And I also support non-open ai tokenizers. So adding a price calculator function would be trivial.

[1] https://github.com/ShelbyJenkins/llm_utils

55. weird-eye-issue ◴[] No.40718221[source]
Yeah... it obviously uses the appropriate tokenizer
replies(1): >>40718459 #
56. weird-eye-issue ◴[] No.40718239{4}[source]
I can tell you've never actually built anything worthwhile
replies(1): >>40718285 #
57. refulgentis ◴[] No.40718266{3}[source]
Yes it does, and no it doesn't.

It is exactly as bad of a situation as I laid out.

It is a tiktoken wrapper that only does CL100K, doesn't bother with anything beyond that, even the message frame tokens, and claims to calculate cost for 400 LLMs.

replies(1): >>40719389 #
58. refulgentis ◴[] No.40718285{5}[source]
Lol. Drive by insult that's A) obviously wrong, and funnily enough, it's the attention to detail that got me there B) in service of caping for "33% error in financial calculations is actually fine"
replies(1): >>40718407 #
59. GranPC ◴[] No.40718364{4}[source]
Sorta: https://github.com/javirandor/anthropic-tokenizer
60. weird-eye-issue ◴[] No.40718407{6}[source]
That's fine, different people have different definitions of worthwhile
replies(1): >>40718421 #
61. refulgentis ◴[] No.40718421{7}[source]
I hope your day gets better!
62. spencerchubb ◴[] No.40718459{3}[source]
Some companies like Anthropic haven't publicly released the tokenizer used in their API, thereby making it impossible for this library to use the appropriate tokenizer in all cases. Be careful about how you use the word 'obviously'
63. hansvm ◴[] No.40718874{4}[source]
This thread looks spicy, so I won't address most of it. On the "2x vocab == 1/2 the tokens" idea though:

It usually doesn't pan out that way. Tokens aren't uniformly likely in normal text. They tend to follow some kind of a power law (like pdf(x) ~ x^0.4), and those 100k extra tokens, even assuming they're all available for use in purely textual inputs/outputs, will only move you from something like 11.4 bits of entropy per token to 12.1 (a 6% improvement).

With that base idea in mind, how do we square that with your observations of large errors? It's a bit hard to know for certain since you didn't tell us which thing you're encoding, but:

1. Using an estimator, even if fairly precise and unbiased across all text, will have high variance for small inputs. If you actually need to estimate costs accurately for _each_ query (not just have roughly accurate costs summed across many queries or an accurate cost for a large query), this project is, as you pointed out, definitely not going to work.

2. Assuming your query distribution matches the tokenizer's training data in some sense, you would expect those errors to balance out over many queries (comparing total predicted costs to total actual costs) or over a single large query. That's still useful for a lot of people (e.g., to estimate the cost of running models across a large internal corpus).

3. Out-of-distribution queries are another interesting use-case where this project falls flat. IIRC somebody here on HN comments frequently about using LLMs for Hebrew text (specifically noting good performance with no tokenization, which is another fun avenue of research), and if any of the models included Hebrew-specific tokenization (I think code-specific tokenization is probably more likely in the big models, not that the specific example matters), you'll likely find that the model in question is much cheaper than the rest for those kinds of queries. There's no free lunch, and you'll necessarily also find pockets of other query types where that model's tokenizer is more expensive than the other tokenizers. This project doesn't have the ability to divine that sort of discrepancy.

4. Even being within 2x on costs (especially when we're talking about 3-4 orders of magnitude of discrepancy in the costs for the different models) is useful. It lets you accomplish things like figuring out roughly the best model you can afford for a kind of task.

Separately:

> This is unnecessarily harsh, I'm not "upset with the implementation", ...

I think the problem people were noting was your tone. Saying you're uncomfortable with the methodology, that you won't use it, highlighting the error bars, pointing out pathological inputs, ..., are all potentially "interesting" to someone. Telling a person they should feel ashamed and are clueless is maybe appropriate somewhere sometimes, but it's strongly frowned upon at HN, usually shouldn't be done publicly in any setting, and is also a bit extreme for a project which is useful even in its current state. Cluelessness is an especially hard claim to justify when the project actually addresses your concern (just via a warning instead of a hard failure or something, which isn't your preferred behavior).

replies(1): >>40719623 #
64. armen99 ◴[] No.40719275[source]
This is great project! I would love to see something that calculates training costs as well.
65. J_Shelby_J ◴[] No.40719389{4}[source]
> tiktoken.encoding_for_model(model)

Calling this where model == 'gpt-4o' will encode with CL200k no?

But yes, I do agree with you. I had time implementing non-tiktoken tokenizers for my project. I ended up manually adding tokenizer.json files into my repo.[1] The other options is downloading from HF, but the official repos where the model's tokenizer.json lives require agreeing to their terms to access. So it requires an HF key, and agreeing to the terms. So not a good experience for a consumer of the package.

> Message frame tokens?

Do you mean the chat template tokens? Oh, that's another good point. Yeah, it counts OpenAI prompt tokens, but you're right it doesn't count chat template tokens. So that's another source of inaccuracy. I solved this by implementing a Jinja templating engine to create the full prompt. [2] Granted, both llama.cpp and mistral-rs do this on the backend, so it's purely for counting tokens. I guess it would make sense to add a function to convert tokens to Dollars.

[1] https://github.com/ShelbyJenkins/llm_utils/tree/main/src/mod... [2] https://github.com/ShelbyJenkins/llm_utils/blob/main/src/pro...

replies(1): >>40723841 #
66. refulgentis ◴[] No.40719623{5}[source]
I love the deep dive!

On the other hand, this situation is bad and the idea we should ignore it is misguided.

Below, we show that given only 1 tokenizer is used, any outputs collapse to a function that is constant, the per token cost. This is why it's shameful: I get that people can't believe its just C100K, but it is, and the writers know that, and they know at that point there is no function, just a constant.

> (especially when we're talking about 3-4 orders of magnitude of discrepancy in the costs for the different models)

Between a large and small model from the same provider, but not inter-provider.

The OOM mention hides the ball, yet shows something very important: no one uses a library to get a rough cost estimate when there's a 3 OOMs difference. You would use it if you were comparing models which were closer in cost...except you can't...because tokens is a constant, because they only use 1 tokenizer.

> It lets you accomplish things like figuring out roughly the best model you can afford for a kind of task.

The library calls Tiktoken to get the C100K count of tokens. Cost = cost per token * tokens. If it is only using C100K, tokens is constant, and the only relevant thing is cost per token, another constant. Now we're outside the realm of even needing a function.

> you're uncomfortable with the methodology, that you won't use it, highlighting the error bars, pointing out pathological inputs, ..., are all potentially "interesting" to someone.

Tangential critiques are preferable? Is the issue pathological inputs? Or is the issue that its stated, documented, and explicit purpose is cost calculation based on token calculation for 400 LLMs, and it only supports 1 tokenizer and isn't even trying to make accurate cost estimates? It's passing your string to Tiktoken for a C100K count, it's not doing the bare minimum of every other tokenizer library I've seen that builds on Tiktoken.

Note there are no error bars that will satisfy because A) it's input dependent and B) it's impossible to get definitive error bars because the tokenizers they claim to support don't have any public documentation, anywhere.

It's shameful to ship code that claims to calculate financial costs, is off by a minimum of 30%+, and doesn't document any of that, anywhere. This is commonly described as fraud. Shameful is a weak claim.

replies(1): >>40720538 #
67. szundi ◴[] No.40720216{6}[source]
These are called the variable variable costs now
68. hansvm ◴[] No.40720538{6}[source]
> Tangential critiques are preferable?

Not at all. Reasoned arguments and adding information (the examples I gave were what I thought your main points were) to the discussion are preferable to (what seemed to be) character attacks. Your comment here, as an example, was mostly great. It provides the same level of usefulness to anyone reading it (highlighting that the computation is just C100K and that people will be misled if they try to use it the wrong way), and you also added reasoned counter-arguments to my OOM idea and several other interesting pieces of information. To the extent that you kept the character attacks against the author, you at least softened the language.

Respectfully attacking ideas instead of people is especially important in online discourse like this. Even if you're right, attacking people tends to spiral a conversation out of control and convince no one (often persuading them of the opposite of whatever you were trying to say).

> just C100K

It's not just C100K though. It is for a few models [0], but even then the author does warn the caller (mind you, I prefer mechanisms like an `allow_approximate_token_count=False` parameter or whatever, but that's not fraud on the author's part; that's a dangerous API design).

Going back to the "tone" thing, calling out those sorts of deficiencies is a great way to warn other people, let them decide if that sort of thing matters for their use case, point out potential flaws in your own reasoning (e.g., it's not totally clear to me if you think the code always uses C100K or always uses it for a subset of models, but if it's the former then you'd probably be interested in knowing that the tokenizer is actually correct for most models) and discuss better API designs. It makes everyone better off for having read your comment and invites more discussions which will hopefully also make everyone better off.

> outside the realm of even needing a function

Maybe! I'd argue that it's useful to have all those prices (especially since not all tokens are created equally) in one place somewhere, but arguing that this is left-pad for LLM pricing is also a reasonable thing to talk about.

> it's impossible to get definitive error bars

That's also true, but that doesn't matter for every application. E.g., suppose you want to run some process on your entire corporate knowledge-base and want a ballpark estimate of costs. The tokenizer error is on average much smaller than the 30%+ you saw for some specific (currently unknown to us here at HN) very small input. Just run your data through this tool, tally up the costs, and you ought to be within 10%. Nobody cares if it's a $900 project or a $1300 project (since nobody is allocating expensive, notoriously unpredictable developers to a project with only 10-30% margins). You just tell the stakeholders it'll cost $2k and a dev-week, and if it takes less then everyone is happily surprised. If they say no at that estimate, they probably wouldn't have been ecstatic with the result if it actually cost $900 and a dev-day anyway.

[0] https://github.com/AgentOps-AI/tokencost/blob/main/tokencost...

replies(1): >>40720937 #
69. ◴[] No.40720767[source]
70. neverokay ◴[] No.40720776[source]
Gemini also only charges for output tokens, not sure if that’s considered.

All in all this is something I was looking for or was roughly going to do to compare costs. Cool stuff.

71. refulgentis ◴[] No.40720937{7}[source]
I really appreciate your engagement here and think it has great value on a personal level, but the length and claims tend to hide two very obvious, straightforward things that are hilariously bad to the point its unbelievable:

1. They only support GPT3.5 and GPT4.0. Note here: [1], and that gpt-4o would get swallowed into gpt-4-0613.

2. This will lead to massive, significant, embarrassingly large error in calculations. Tokenizers are not mostly the same, within 10% error.

# Explicating #1, Responsive to ex. "It's not just C100K though. It is for a few models [0]".

The link is to Tiktoken, OpenAI's tokenization library. There are literally more than GPT3.5 and GPT4.0 there, but they're just OpenAI's models, no one else's, none of the others in the long list in their documentation, and certainly not 400.

Most damning? There's only 2 other tokenizers, long deprecated, used only for deprecated models not served anymore, thus you're not calculating costs with them. The only live ones are c100k and o200k. As described above, and shown in [1], their own code kneecaps the o200k and will use c100k

# Explicating #2

Let me know what you'd want to see if you're curious about the 30%+ error thing. I don't want to guess at a test suite that would make you confident you need to revise a prior that there's only +/- 10% difference between arbitrary tokenizers.

For context, I run about 20 unit tests, for each of the big 5 providers, with the same prompts, to capture their input and output token counts to make sure I'm billing accurately.

# Conclusion

Just to save you time, I think the best way I can provide some value is token count experiments demonstrating error. You won't be able to talk me down to "eh, lets just say its +/- 10%, thats good enough for most people!" --- It matters, if it didn't, they'd explicate at least some of this. Instead, its "tokenization for 400 LLMs!"

[1] https://github.com/AgentOps-AI/tokencost/blob/e1d52dbaa3ada2...*

replies(1): >>40745274 #
72. refulgentis ◴[] No.40723841{5}[source]
>> tiktoken.encoding_for_model(model) > Calling this where model == 'gpt-4o' will encode with CL200k no?

No, it will never use O200K, I don't know how to word where its located without sounding aggro, apologies: read below, i.e. the rest of the method.

They copied demo code for Tiktoken with an allowlist without gpt-4o in it, because the demo code is from before 4o.

The demo code has an allowlist, that does string matching, and if its not one of 5 models, none of which are gpt-4o, it says "eh, if it starts gpt-4, just use gpt-4-0613, and make a recursive call"

You can't really blame them, because all they did was copy demo code from OpenAI from before gpt-4o, but I hope you get a giggle out of the extreme clown car this situation is. It's a really bad paper-thin out-of-date tiktoken wrapper that can only do c100k and claims support for 400 LLMs.

Really bonkers.

I know you gotta read the whole method to get it, but, people really shouldn't have just been like "my word! its mean to say they don't get it!" -- it's horrible.

https://github.com/AgentOps-AI/tokencost/blob/e1d52dbaa3ada2...

73. kevindamm ◴[] No.40727334{7}[source]
Perhaps they have a new tokenization method that's non-deterministic? If there are parallel lookaheads, not necessarily an rng, race conditions could make for variable cost. Or an expansion of certain terms into multiple token outputs, but the selection of which terms are expanded is based on a dynamic pool of Named Entities being recognized. Or maybe they just want to hide their process.. there is some secret sauce there even if it's deterministic, and so much depends on getting a good initial embedding, I've seen a tokenizer make or break an otherwise great model.

I am merely hypothesizing, it may not be nondeterministic but I'm not going to assume it's not.

74. hansvm ◴[] No.40745274{8}[source]
Oh I see (tiktoken). That's my mistake. I naively assumed the only good reason to pull in a 3rd party lib like that is if it actually did a reasonable amount of work.

> curious about the 30%+ error thing

I'm mildly curious. I have no doubt that small strings will often have high relative errors. I'd be surprised though if sum(estimated)/sum(actual) were very large if you copied in either a large piece of text or many small pieces of text, outside of specialized domains out of the normal scope of that tokenizer (e.g., throwing latex code into something trained just on wikipedia).

That's more for entropic reasons than anything else. The only way that's true is if (1) some of these tokenizers are much less naive than the normal LLM literature and actually approach entropic bounds, or (2) the baseline implementations are especially bad so that there's a lot of headroom for improvements.

What happens when you throw in something like a medium-sized plain-text wikipedia article (say, the first half as input and the second as output)?

> messaging -- tokenization for 400 LLMs

Alright, I'm sold. I'm still partial to Hanlon's razor for these sort of things, but that ought to be patched.

75. sitkack ◴[] No.40749171{7}[source]
Where the cost per mile is published, but their mile ain’t yer mile.
76. Breza ◴[] No.40777992{6}[source]
This is what Washington DC did when I moved here. They theoretically had zones, but in reality it was arbitrary. Moving to meters was an amazing development.

Similarly, as LLMs become more and more commonplace, the pricing models will need to be more predictable. My LLM expenses are only around $100/month, but it's a bigger impediment to pushing projects to production when I can't tell the boss exactly how it'll be priced.

77. Breza ◴[] No.40788282{3}[source]
I've encountered plenty of tasks where lower quality models work quite well. I prefer using Claude 3 Opus, DBRX, or Llama-3, but that level of quality isn't always needed. Here are a few examples.

Top story picker. Given a bunch of news stories, pick which one should be the lead story.

Data viz color picker. Given a list of categories for a chart, return a color for each one.

Windows Start menu. Given a list of installed programs and a query, select the five most likely programs that the user wants.