←back to thread

196 points zmccormick7 | 1 comments | | HN request time: 0s | source
Show context
asdev ◴[] No.45387765[source]
I don't think intelligence is increasing. Arbitrary benchmarks don't reflect real world usage. Even with all the context it could possibly have, these models still miss/hallucinate things. Doesn't make them useless, but saying context is the bottleneck is incorrect.
replies(3): >>45388096 #>>45388362 #>>45398947 #
reclusive-sky ◴[] No.45388096[source]
I agree, I often see Opus 4.1 and GPT5 (Thinking) make astoundingly stupid decisions with full confidence, even on trivial tasks requiring minimal context. Assuming they would make better decisions "if only they had more context" is a fallacy
replies(1): >>45388358 #
alchemist1e9 ◴[] No.45388358[source]
Is there a good example you could provide of that? I just haven’t seen that personally so I’d be interested in any examples on these current models. I’m sure we all remember in the early days lots of examples of stupidity being posted and it was interesting. It be great if people kept doing that so we could get a better sense of which types of problems they are failing with astounding levels of stupidity on.
replies(2): >>45391205 #>>45395031 #
1. scoopdiwhoop ◴[] No.45391205[source]
One example I ran into recently is asking Gemini CLI to do something that isn't possible: use multiple tokens in a Gemini CLI custom command (https://github.com/google-gemini/gemini-cli/blob/main/docs/c...). It pretended it was possible and came up with a nonsense .toml defining multiple arguments in a way it invented so it couldn't be read, even after multiple rounds of "that doesn't work, Gemini can't load this."

So in any situation where something can't actually be done my assumption is that it's just going to hallucinate a solution.

Has been good for busywork that I know how to do but want to save time on. When I'm directing it, it works well. When I'm asking it to direct me, it's gonna lead me off a cliff if I let it.