Show HN: Token price calculator for 400+ LLMs

1. Lerc ◴[17 Jun 24 21:36 UTC] No.40711341[source]▶

With all the options there seems like an opportunity for a single point API that can take a series of prompts, a budget and a quality hint to distribute batches for most bang for buck.

Maybe a small triage AI to decide how effectively models handle certain prompts to preserve spending for the difficult tasks.

Does anything like this exist yet?

replies(3): >>40712921 #>>40715521 #>>40715879 #

2. curious_cat_163 ◴[18 Jun 24 00:47 UTC] No.40712921[source]▶

>>40711341 (TP) #

I have yet to find a use case where quality can be traded off.

Would love to hear what you had in mind.

replies(3): >>40713311 #>>40713472 #>>40788282 #

3. mvdtnz ◴[18 Jun 24 01:51 UTC] No.40713311[source]▶

>>40712921 #

Every single use case of LLMs inherently sacrifices quality, whether the developers are willing to admit it or not. I agree with you though that there aren't many use cases where end users would knowingly accept the trade off.

4. Lerc ◴[18 Jun 24 02:19 UTC] No.40713472[source]▶

>>40712921 #

It is not so much a drop in quality as there are tasks that every model above a certain threshold will perform equally.

Most can do 2+2 = 4.

One test prompt I use on LLMs is asking it to produce a JavaScript function that takes an ImageData object and returns a new ImageData object with an all direction Sobel edge detection. Quite a lot of even quite small models can generate functions like this.

In general, I don't even think this is a question that needs to be answered. A lot of API providers have different quality/price tiers. The fact that people are using the different tiers should be sufficient to show that at least some people are finding cases where cheaper models are good enough.

5. codewithcheese ◴[18 Jun 24 09:05 UTC] No.40715521[source]▶

>>40711341 (TP) #

That's openrouter, they are listed

6. michaelt ◴[18 Jun 24 10:05 UTC] No.40715879[source]▶

>>40711341 (TP) #

Generally, if you've got a task big enough that you're worried about pricing, it's probably going to involve thousands of API calls.

In that case you might as well make ~20 API calls to each LLM under consideration, and evaluate the results yourself.

It's far easier to evaluate a model's performance on a given prompt by looking at the output than by looking at the input alone.

7. Breza ◴[25 Jun 24 13:20 UTC] No.40788282[source]▶

>>40712921 #

I've encountered plenty of tasks where lower quality models work quite well. I prefer using Claude 3 Opus, DBRX, or Llama-3, but that level of quality isn't always needed. Here are a few examples.

Top story picker. Given a bunch of news stories, pick which one should be the lead story.

Data viz color picker. Given a list of categories for a chart, return a color for each one.

Windows Start menu. Given a list of installed programs and a query, select the five most likely programs that the user wants.