Edit: I never actually expected AGI from LLMs. That was snark. I just think it's notable that the fundamental gains in LLM performance seem to have dried up.
But why does this paper impact your thinking on it? It is about budget and recognizing that different LLMs have different cost structures. It's not really an attempt to improve LLM performance measured absolutely.
Rather than the much more obvious: Preference-prior Informed Linucb For Adaptive Routing (PILFAR)
arxiv is essentially a blog under an academic format, popular amongst asian and south asian academic communities
currently you can launder reputation with it, just like “white papers” in the crypto world allowed for capital for some time
this ability will diminish as more people catch on
And most would have accept the recommendation because the model sold it as less common tactic, while sounding very logical.
Once you've started to argue with an LLM you're already barking up the wrong tree. Maybe you're right, maybe not, but there's no point in arguing it out with an LLM.
It's mostly hand waving, hype and credulity, and unproven claims of scalability right now.
You can't move the goal posts because they don't exist.
So many people just want to believe, instead of the reality of LLMs being quite unreliable.
Personally it's usually fairly obvious to me when LLMs are bullshitting probably because I have lots of experience detecting it in humans.
It'll be a while until the ability to move the goalposts of "actual intelligence" is exhausted entirely.
In this case I just happened to be domain expert and knew it was wrong. It would have required significant effort to verify everything with some less experienced person.
So far, my experience has been that it's just too early for most people / applications to worry about cost - at most, I've seen AI to be accountable for 10% of cloud costs. But very curious if others have other experiences.
To help with quality and improve our products, human reviewers may read, annotate, and process your API input and output. Google takes steps to protect your privacy as part of this process. This includes disconnecting this data from your Google Account, API key, and Cloud project before reviewers see or annotate it. Do not submit sensitive, confidential, or personal information to the Unpaid Services."
Reference: https://ai.google.dev/gemini-api/terms
And the kind of automation brought by LLMs is decidely different than automation in the past which almost always created new (usually better) jobs. LLMs won't do this (at least to extent where it would matter) I think. Most people in ten years will have worse jobs (more physically straining, longer hours, less pay) unless there will be a political intervention.
Obviously we don't use the super expensive ones like GPT4.5 or so. But we don't really bother with mini models, because GPT4.1 etc.. are cheap enough.
Stuff like speech to text etc.. are still way more expensive, and yes there we do focus on cost optimization. We have no large scale image generation use cases (yet)
Doesn't mean there aren't practical definitions depending on the context.
In essence, teaching an AI using recources meant for humans, and nothing more, would be considered AGI. That could be a practical definition, without needing much more rigour.
There is indeed no evidence we'll get there. But there is also no evidence LLM's should work as well as they do
So the answer is no then, because I don't put reasoning and non-reasoning models in the same ballpark when it comes to token usage. You can just turn off reasoning.