Maybe a small triage AI to decide how effectively models handle certain prompts to preserve spending for the difficult tasks.
Does anything like this exist yet?
Tokencost works by counting the number of tokens in prompt and completion messages and multiplying that number by the corresponding model cost. Under the hood, it’s really just a simple cost dictionary and some utility functions for getting the prices right. It also accounts for different tokenizers and float precision errors.
Surprisingly, most model providers don't actually report how much you spend until your bills arrive. We built Tokencost internally at AgentOps to help users track agent spend, and we decided to open source it to help developers avoid nasty bills.
Maybe a small triage AI to decide how effectively models handle certain prompts to preserve spending for the difficult tasks.
Does anything like this exist yet?
Would love to hear what you had in mind.
Most can do 2+2 = 4.
One test prompt I use on LLMs is asking it to produce a JavaScript function that takes an ImageData object and returns a new ImageData object with an all direction Sobel edge detection. Quite a lot of even quite small models can generate functions like this.
In general, I don't even think this is a question that needs to be answered. A lot of API providers have different quality/price tiers. The fact that people are using the different tiers should be sufficient to show that at least some people are finding cases where cheaper models are good enough.
In that case you might as well make ~20 API calls to each LLM under consideration, and evaluate the results yourself.
It's far easier to evaluate a model's performance on a given prompt by looking at the output than by looking at the input alone.
Top story picker. Given a bunch of news stories, pick which one should be the lead story.
Data viz color picker. Given a list of categories for a chart, return a color for each one.
Windows Start menu. Given a list of installed programs and a query, select the five most likely programs that the user wants.