The End of Moore's Law for AI? Gemini Flash Offers a Warning

(sutro.sh)

113 points sethkim | 1 comments | 03 Jul 25 17:34 UTC | HN request time: 0.438s | source

Show context

simonw ◴[03 Jul 25 18:21 UTC] No.44457827[source]▶

"In a move that at first went unnoticed, Google significantly increased the price of its popular Gemini 2.5 Flash model"

It's not quite that simple. Gemini 2.5 Flash previously had two prices, depending on if you enabled "thinking" mode or not. The new 2.5 Flash has just a single price, which is a lot more if you were using the non-thinking mode and may be slightly less for thinking mode.

Another way to think about this is that they retired their Gemini 2.5 Flash non-thinking model entirely, and changed the price of their Gemini 2.5 Flash thinking model from $0.15/m input, $3.50/m output to $0.30/m input (more expensive) and $2.50/m output (less expensive).

Another minor nit-pick:

> For LLM providers, API calls cost them quadratically in throughput as sequence length increases. However, API providers price their services linearly, meaning that there is a fixed cost to the end consumer for every unit of input or output token they use.

That's mostly true, but not entirely: Gemini 2.5 Pro (but oddly not Gemini 2.5 Flash) charges a higher rate for inputs over 200,000 tokens. Gemini 1.5 also had a higher rate for >128,000 tokens. As a result I treat those as separate models on my pricing table on https://www.llm-prices.com

One last one:

> o3 is a completely different class of model. It is at the frontier of intelligence, whereas Flash is meant to be a workhorse. Consequently, there is more room for optimization that isn’t available in Flash’s case, such as more room for pruning, distillation, etc.

OpenAI are on the record that the o3 optimizations were not through model changes such as pruning or distillation. This is backed up by independent benchmarks that find the performance of the new o3 matches the previous one: https://twitter.com/arcprize/status/1932836756791177316

replies(2): >>44457905 #>>44458685 #

mathiaspoint ◴[03 Jul 25 20:02 UTC] No.44458685[source]▶

>>44457827 #

I really hate the thinking. I do my best to disable it but don't always remember. So often it just gets into a loop second guessing itself until it hits the token limit. It's rare it figures anything out while it's thinking too but maybe that's because I'm better at writing prompts.

replies(3): >>44458973 #>>44460494 #>>44461274 #

thomashop ◴[03 Jul 25 20:38 UTC] No.44458973[source]▶

>>44458685 #

I have the impression that the thinking helps even if the actual content of the thinking output is nonsense. It awards more cycles to the model to think about the problem.

replies(1): >>44459007 #

wat10000 ◴[03 Jul 25 20:42 UTC] No.44459007[source]▶

>>44458973 #

That would be strange. There's no hidden memory or data channel, the "thinking" output is all the model receives afterwards. If it's all nonsense, then nonsense is all it gets. I wouldn't be completely surprised if a context with a bunch of apparent nonsense still helps somehow, LLMs are weird, but it would be odd.

replies(4): >>44459087 #>>44459102 #>>44459217 #>>44459436 #

1. yorwba ◴[03 Jul 25 20:55 UTC] No.44459102[source]▶

>>44459007 #

Attention operates entirely on hidden memory, in the sense that it usually isn't exposed to the end user. An attention head on one thinking token can attend to one thing and the same attention head on the next thinking token can attend to something entirely different, and the next layer can combine the two values, maybe on the second thinking token, maybe much later. So even nonsense filler can create space for intermediate computation to happen.

↑