Jagged AGI: o3, Gemini 2.5, and everything after

(www.oneusefulthing.org)

265 points ctoth | 1 comments | 20 Apr 25 14:55 UTC | HN request time: 0s | source

Show context

sejje ◴[20 Apr 25 17:04 UTC] No.43744995[source]▶

In the last example (the riddle)--I generally assume the AI isn't misreading, rather that it assumes you couldn't give it the riddle correctly, but it has seen it already.

I would do the same thing, I think. It's too well-known.

The variation doesn't read like a riddle at all, so it's confusing even to me as a human. I can't find the riddle part. Maybe the AI is confused, too. I think it makes an okay assumption.

I guess it would be nice if the AI asked a follow up question like "are you sure you wrote down the riddle correctly?", and I think it could if instructed to, but right now they don't generally do that on their own.

replies(5): >>43745113 #>>43746264 #>>43747336 #>>43747621 #>>43751793 #

moffkalast ◴[20 Apr 25 20:18 UTC] No.43746264[source]▶

>>43744995 #

Yeah you need specific instruct training for that sort of thing, Claude Opus being one of the rare examples that does such a sensibility check quite often and even admits when it doesn't know something.

These days it's all about confidently bullshitting on benchmarks and overfitting on common riddles to make pointless numbers go up. The more impressive models get on paper, the more rubbish they are in practice.

replies(2): >>43746913 #>>43750499 #

1. pants2 ◴[20 Apr 25 22:16 UTC] No.43746913[source]▶

>>43746264 #

Gemini 2.5 is actually pretty good at this. It's the only model ever to tell me "no" to a request in Cursor.

I asked it to add websocket support for my app and it responded like, "looks like you're using long polling now. That's actually better and simpler. Lets leave it how it is."

I was genuinely amazed.

↑