OpenAI o3 and o4-mini

(openai.com)

555 points maheshrijal | 4 comments | 16 Apr 25 17:01 UTC | HN request time: 0.002s | source

Show context

_fat_santa ◴[16 Apr 25 17:24 UTC] No.43708027[source]▶

So at this point OpenAI has 6 reasoning models, 4 flagship chat models, and 7 cost optimized models. So that's 17 models in total and that's not even counting their older models and more specialized ones. Compare this with Anthropic that has 7 models in total and 2 main ones that they promote.

This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones. In fact based on some of the other comments here it sounds like these are just updates to their existing model, but they release them as new models to create more media buzz.

replies(22): >>43708044 #>>43708100 #>>43708150 #>>43708219 #>>43708340 #>>43708462 #>>43708605 #>>43708626 #>>43708645 #>>43708647 #>>43708800 #>>43708970 #>>43709059 #>>43709249 #>>43709317 #>>43709652 #>>43709926 #>>43710038 #>>43710114 #>>43710609 #>>43710652 #>>43713438 #

shmatt ◴[16 Apr 25 17:59 UTC] No.43708462[source]▶

>>43708027 #

Im old enough to remember the mystery and hype before o*/o1/strawberry that was supposed to be essentially AGI. We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet

Now we're up to o4, AGI is still not even in near site (depending on your definition, I know). And OpenAI is up to about 5000 employees. I'd think even before AGI a new model would be able to cover for at least 4500 of those employees being fired, is that not the case?

replies(8): >>43708694 #>>43708755 #>>43708824 #>>43709411 #>>43709774 #>>43710199 #>>43710213 #>>43710748 #

actsasbuffoon ◴[16 Apr 25 19:31 UTC] No.43709411[source]▶

>>43708462 #

Meanwhile even the highest ranked models can’t do simple logic tasks. GothamChess on YouTube did some tests where he played against a bunch of the best models and every single one of them failed spectacularly.

They’d happily lose a queen to take a pawn. They failed to understand how pieces are even allowed to move, hallucinated the existence of new pieces, repeatedly declared checkmate when it wasn’t, etc.

I tried it last night with Gemini 2.5 Pro and it made it 6 turns before it started making illegal moves, and 8 turns before it got so confused about the state of the board before it refused to play with me any longer.

I was in the chess club in 3rd grade. One of the top ranked LLMs in the world is vastly dumber than I was in 3rd grade. But we’re going to pour hundreds of billions into this in the hope that it can end my career? Good luck with that, guys.

replies(4): >>43709556 #>>43710189 #>>43710252 #>>43716131 #

JFingleton ◴[16 Apr 25 20:51 UTC] No.43710252[source]▶

>>43709411 #

I'm not sure why people are expecting a language model to be great at chess. Remember they are trained on text, which is not the best medium for representing things like a chess board. They are also "general models", with limited training on pretty much everything apart from human language.

An Alpha Star type model would wipe the floor at chess.

replies(2): >>43710659 #>>43710715 #

actsasbuffoon ◴[16 Apr 25 21:35 UTC] No.43710659{3}[source]▶

>>43710252 #

This misses the point. LLMs will do things like move a knight by a single square as if it were a pawn. Chess is an extremely well understood game, and the rules about how things move is almost certainly well-represented in the training data.

These models cannot even make legal chess moves. That’s incredibly basic logic, and it shows how LLMs are still completely incapable of reasoning or understanding. Many kinds of task are never going to be possible for LLMs unless that changes. Programming is one of those tasks.

replies(2): >>43710807 #>>43710906 #

1. simonw ◴[16 Apr 25 21:52 UTC] No.43710807{4}[source]▶

>>43710659 #

Saying programming is a task that is "never going to be possible" for an LLM is a big claim, given how many people have derived huge value from having LLMs write code for them over the past two years.

(Unless you're arguing against the idea that LLMs are making programmers obsolete, in which case I fully agree with you.)

replies(1): >>43716355 #

2. sceptic123 ◴[17 Apr 25 13:15 UTC] No.43716355[source]▶

>>43710807 (TP) #

I think "useful as an assistant for coding" and "being able to program" are two different things.

When I was trying to understand what is happening with hallucination GPT gave me this: > It's called hallucinating when LLMs get things wrong because the model generates content that sounds plausible but is factually incorrect or made-up—similar to how a person might "see" or "experience" things that aren't real during a hallucination.

From that we can see that they fundamentally don't know what is correct. While they can get better at predicting correct answers, no-one has explained how they are expected to cross the boundary from "sounding plausible" to "knowing they are factually correct". All the attempts so far seem to be about reducing the likelihood of hallucination, not fixing the problem that they fundamentally don't understand what they are saying.

Until/unless they are able to understand the output enough to verify the truth then there's a knowledge gap that seems dangerous given how much code we are allowing "AI" to write.

replies(1): >>43717015 #

3. simonw ◴[17 Apr 25 14:02 UTC] No.43717015[source]▶

>>43716355 #

Code is one of the few applications of LLMs where they DO have a mechanism for verifying if what they produced is correct: they can write code, run that code, look at the output and iterate in a loop until it does what it's supposed to do.

replies(1): >>43761392 #

4. sceptic123 ◴[22 Apr 25 12:35 UTC] No.43761392{3}[source]▶

>>43717015 #

But that requires code that is runnable and testable in isolation otherwise there are all sorts issues with that approach (aside from the obvious one of scalability)

It also assumes they "understand" enough to be able to extract the correct output to test against.

↑