←back to thread

GPT-5.2

(openai.com)
1019 points atgctg | 1 comments | | HN request time: 0.209s | source
Show context
dinobones ◴[] No.46235868[source]
It's becoming challenging to really evaluate models.

The amount of intelligence that you can display within a single prompt, the riddles, the puzzles, they've all been solved or are mostly trivial to reasoners.

Now you have to drive a model for a few days to really get a decent understanding of how good it really is. In my experience, while Sonnet/Opus may not have always been leading on benchmarks, they have always *felt* the best to me, but it's hard to put into words why exactly I feel that way, but I can just feel it.

The way you can just feel when someone you're having a conversation with is deeply understanding you, somewhat understanding you, or maybe not understanding at all. But you don't have a quantifiable metric for this.

This is a strange, weird territory, and I don't know the path forward. We know we're definitely not at AGI.

And we know if you use these models for long-horizon tasks they fail at some point and just go off the rails.

I've tried using Codex with max reasoning for doing PRs and gotten laughable results too many times, but Codex with Max reasoning is apparently near-SOTA on code. And to be fair, Claude Code/Opus is also sometimes equally as bad at doing these types of "implement idea in big codebase, make changes too many files, still pass tests" type of tasks.

Is the solution that we start to evaluate LLMs on more long-horizon tasks? I think to some degree this was the spirit of SWE Verified right? But even that is being saturated now.

replies(2): >>46236090 #>>46239295 #
1. Libidinalecon ◴[] No.46239295[source]
Totally agree. I just got a free trial month I guess to try to bring me back to chatGPT but I don't really know what to ask it to display if it is on par with Gemini.

I really have a sinking feel right now actually of what an absolute giant waste of capital all this is.

I am glad for all the venture capital behind all this to subsidize my intellectual noodlings on a super computer but my god what have we done?

This is so much fun but this doesn't feel like we are getting closer to "AGI" after using Gemini for about 100 hours or so now. The first day maybe but not now when you see how off it can still be all the time.