Adaptive LLM routing under budget constraints

1. pbd ◴[01 Sep 25 17:49 UTC] No.45094941[source]▶

GPT-4 at $24.7 per million tokens vs Mixtral at $0.24 - that's a 100x cost difference! Even if routing gets it wrong 20% of the time, the economics still work. But the real question is how you measure 'performance' - user satisfaction doesn't always correlate with technical metrics.

replies(5): >>45095081 #>>45095225 #>>45095267 #>>45095811 #>>45095813 #

2. Keyframe ◴[01 Sep 25 18:04 UTC] No.45095081[source]▶

>>45094941 (TP) #

number of complaints / million tokens?

3. pqtyw ◴[01 Sep 25 18:20 UTC] No.45095225[source]▶

>>45094941 (TP) #

> GPT-4 at $24.7 per million tokens

While technically true why would you want to use it when OpenAI itself provides a bunch of many times cheaper and better models?

replies(1): >>45095604 #

4. FINDarkside ◴[01 Sep 25 18:26 UTC] No.45095267[source]▶

>>45094941 (TP) #

It's trivial to get better score than GPT-4 with 1% of the cost by using my propertiary routing algorithm that routes all requests to Gemini 2.5 Flash. It's called GASP (Gemini Always, Save Pennies)

replies(1): >>45095832 #

5. KTibow ◴[01 Sep 25 19:04 UTC] No.45095604[source]▶

>>45095225 #

RouterBench is from March 2024.

6. simpaticoder ◴[01 Sep 25 19:27 UTC] No.45095811[source]▶

>>45094941 (TP) #

PPT (price-per-token) is insufficient to compute cost. You will also need to know an average tokens-per-interaction (TPI). They multiply to give you a cost estimate. A .01x PPT is wiped out by 100x TPI.

replies(1): >>45096397 #

7. mkoubaa ◴[01 Sep 25 19:28 UTC] No.45095813[source]▶

>>45094941 (TP) #

> How you measure 'performance'

I heard the best way is through valuations

8. nutjob2 ◴[01 Sep 25 19:30 UTC] No.45095832[source]▶

>>45095267 #

Does anyone working in an individual capacity actually end up paying for Gemini (Flash or Pro)? Or does Google boil you like a frog and you end up subscribing?

replies(4): >>45095961 #>>45096427 #>>45099635 #>>45100282 #

9. aspect8445 ◴[01 Sep 25 19:45 UTC] No.45095961{3}[source]▶

>>45095832 #

I've used Gemini in a lot of personal projects. At this point I've probably made tens of thousands of requests, sometimes exceeding 1k per week. So far, I haven't had to pay a dime!

replies(1): >>45097258 #

10. monsieurbanana ◴[01 Sep 25 20:35 UTC] No.45096397[source]▶

>>45095811 #

Are you saying that some models will take 100x more tokens than other (models in the same ballpark) for the same task? Is the 100 a real measured metric or just random numbers to illustrate a point?

replies(2): >>45096631 #>>45102666 #

11. dcre ◴[01 Sep 25 20:40 UTC] No.45096427{3}[source]▶

>>45095832 #

I've paid a few dollars a month for my API usage for about 6 months.

12. simpaticoder ◴[01 Sep 25 21:10 UTC] No.45096631{3}[source]▶

>>45096397 #

With thinking models, yes 100x is not just possible, but probable. You get charged for the intermediate thinking tokens, even if you don't see them (which is the case for Grok, for example). And even if you do see them, they won't necessarily add value.

replies(1): >>45113503 #

13. worm00111 ◴[01 Sep 25 22:43 UTC] No.45097258{4}[source]▶

>>45095961 #

How come you don't need to pay? Do you get it for free somehow?

replies(1): >>45097410 #

14. KETHERCORTEX ◴[01 Sep 25 23:09 UTC] No.45097410{5}[source]▶

>>45097258 #

There's free tier for API.

replies(1): >>45098785 #

15. drittich ◴[02 Sep 25 03:14 UTC] No.45098785{6}[source]▶

>>45097410 #

"When you use Unpaid Services, including, for example, Google AI Studio and the unpaid quota on Gemini API, Google uses the content you submit to the Services and any generated responses to provide, improve, and develop Google products and services and machine learning technologies, including Google's enterprise features, products, and services, consistent with our Privacy Policy.

To help with quality and improve our products, human reviewers may read, annotate, and process your API input and output. Google takes steps to protect your privacy as part of this process. This includes disconnecting this data from your Google Account, API key, and Cloud project before reviewers see or annotate it. Do not submit sensitive, confidential, or personal information to the Unpaid Services."

Reference: https://ai.google.dev/gemini-api/terms

16. baq ◴[02 Sep 25 06:08 UTC] No.45099635{3}[source]▶

>>45095832 #

If I actually had time to work on my hobby projects Gemini pro would be the first thing I’d spend money on. As is, it’s amazing how much progress you can squeeze out of those 5 chats every 24h; I can get a couple hours of before-times hacking done in 15 minutes, which is incidentally when free usage gets throttled and my free time runs out.

17. ivape ◴[02 Sep 25 08:00 UTC] No.45100282{3}[source]▶

>>45095832 #

You get 1500 prompts on AIStudio across a few Gemini flash models. I think I saw 250 or 500 for 2.5. It’s basically free and beats the consumer rate limits of big apps (Claude, ChatGPT, Gemini, meta). I wonder when they’ll cut this off.

18. datadrivenangel ◴[02 Sep 25 13:10 UTC] No.45102666{3}[source]▶

>>45096397 #

the GPT 5 models use ~10x more tokens depending on the reasoning settings.

19. monsieurbanana ◴[03 Sep 25 08:37 UTC] No.45113503{4}[source]▶

>>45096631 #

> With thinking models, yes 100x is not just possible, but probable

So the answer is no then, because I don't put reasoning and non-reasoning models in the same ballpark when it comes to token usage. You can just turn off reasoning.