Jagged AGI: o3, Gemini 2.5, and everything after

(www.oneusefulthing.org)

265 points ctoth | 2 comments | 20 Apr 25 14:55 UTC | HN request time: 0.429s | source

Show context

low_tech_love ◴[20 Apr 25 20:30 UTC] No.43746326[source]▶

The first thing I want AGI to do is to be able to tell me when it doesn’t know something, or when it’s not certain, so at least give me a heads up to set expectations correctly. I ran my own personal “benchmark” on Gemini 2.5 and it failed just like all others. I told it that I was playing an old point-and-click adventure game from the mid-90s and I was stuck on a certain part, and asked for spoiler-light hints on what to do next. Not only can they not give me hints, they hallucinate completely the game, and invent some weird non-sensical solutions. Every single model does this. Even if I tell them to give up and just give me the solution, they come up with some non-existing solution.

I wonder how hard it is to objectively use information that is available online for 30 years? But the worst part is how it lies and pretends it knows what it’s talking about, and when you point it out it simply turns into another direction and starts lying again. Maybe the use case here is not the main focus of modern AI; maybe modern AI is about generating slop that does not require verification, because it’s “new” content. But to me it just sounds like believable slop, not AGI.

replies(1): >>43746351 #

irthomasthomas ◴[20 Apr 25 20:34 UTC] No.43746351[source]▶

>>43746326 #

Here's an example of how my agent handles this:

Gathering context for user request...

Context gathering - Attempting to answer question via LLM: Are there existing Conversation classes in the ecosystem this should extend? Context gathering - LLM provided answer: "No"

Context gathering - Attempting to answer question via LLM: How should model selection work when continuing a previous conversation? Context gathering - LLM answer was UNKNOWN, asking user. Asking user: How should model selection work when continuing a previous conversation?

Context gathering - received user response to question: "How should model selection work when continuing a previous conversation?"

Context gathering - finished processing all user questions Context gathering - processing command executions... Context gathering - executing command: sqlite3 $(find . -name llm_conversations.db) .tables

Context gathering - command execution completed

Context gathering - executing command: grep -r Conversation tests/

Context gathering - command execution completed

Context gathering - executing command: grep -h conversation_id *py Context gathering - command execution completed Context gathering - finished processing all commands Analyzing task complexity and requirements...

DEBUG: reasoning_model: openrouter/google/gemini-2.5-pro-preview-03-25 Task classified as coding (confidence: 1.0) Task difficulty score: 98.01339999999999/100 Selected primary reasoning model: claude-3.7-sonnet get_reasoning_assistance:[:214: integer expression expected: 98.01339999999999 Reasoning assistance completed in 39 seconds Calling LLM with model: claude-3.7-sonnet

replies(1): >>43749716 #

1. low_tech_love ◴[21 Apr 25 08:53 UTC] No.43749716[source]▶

>>43746351 #

I’m sorry but I have no idea what you mean and what you write…!

replies(1): >>43750326 #

2. irthomasthomas ◴[21 Apr 25 10:37 UTC] No.43750326[source]▶

>>43749716 (TP) #

> The first thing I want AGI to do is to be able to tell me when it doesn’t know something,

In my demo, the llm agent asks followup questions to understand the users problem. Then it first attempts to answer those questions using context and function calling. When a question cannot be answered this way it is forwarded to the user. In other words, it tells you when it doesn't know something.

↑