Recent AI model progress feels mostly like bullshit

1. gundmc ◴[06 Apr 25 18:48 UTC] No.43603886[source]▶

This was published the day before Gemini 2.5 was released. I'd be interested if they see any difference with that model. Anecdotally, that is the first model that really made me go wow and made a big difference for my productivity.

replies(4): >>43603928 #>>43603961 #>>43604159 #>>43610218 #

2. jonahx ◴[06 Apr 25 18:52 UTC] No.43603928[source]▶

>>43603886 (TP) #

I doubt it. It still flails miserably like the other models on anything remotely hard, even with plenty of human coaxing. For example, try to get it to solve: https://www.janestreet.com/puzzles/hall-of-mirrors-3-index/

replies(2): >>43604005 #>>43604027 #

3. georgemcbay ◴[06 Apr 25 18:57 UTC] No.43603961[source]▶

>>43603886 (TP) #

As someone who was wildly disappointed with the hype around Claude 3.7, Gemini 2.5 is easily the best programmer-assistant LLM available, IMO.

But it still feels more like a small incremental improvement rather than a radical change, and I still feel its limitations constantly.

Like... it gives me the sort of decent but uninspired solution I would expect it to generate without predictably walking me through a bunch of obvious wrong turns as I repeatedly correct it as I would have to have done with earlier models.

And that's certainly not nothing and makes the experience of using it much nicer, but I'm still going to roll my eyes anytime someone suggests that LLMs are the clear path to imminently available AGI.

replies(1): >>43610584 #

4. Xenoamorphous ◴[06 Apr 25 19:01 UTC] No.43604005[source]▶

>>43603928 #

I’d say the average person wouldn’t understand that problem, let alone solve it.

5. flutas ◴[06 Apr 25 19:04 UTC] No.43604027[source]▶

>>43603928 #

FWIW 2.5-exp was the only one that managed to get a problem I asked it right, compared to Claude 3.7 and o1 (or any of the other free models in Cursor).

It was reverse engineering ~550MB of Hermes bytecode from a react native app, with each function split into a separate file for grep-ability and LLM compatibility.

The others would all start off right then quickly default to just greping randomly what they expected it to be, which failed quickly. 2.5 traced the function all the way back to the networking call and provided the expected response payload.

All the others hallucinated the networking response I was trying to figure out. 2.5 Provided it exactly enough for me to intercept the request and using the response it provided to get what I wanted to show up.

replies(1): >>43604169 #

6. usaar333 ◴[06 Apr 25 19:22 UTC] No.43604159[source]▶

>>43603886 (TP) #

Ya, I find this hard to imagine aging well. Gemini 2.5 solved (at least much better than) multiple real world systems questions I've had in the past that other models could not. Its visual reasoning also jumped significantly on charts (e.g. planning around train schedules)

Even Sonnet 3.7 was able to do refactoring work on my codebase sonnet 3.6 could not.

Really not seeing the "LLMs not improving" story

7. arkmm ◴[06 Apr 25 19:24 UTC] No.43604169{3}[source]▶

>>43604027 #

How did you fit 550MB of bytecode into the context window? Was this using 2.5 in an agentic framework? (i.e. repeated model calls and tool usage)

replies(1): >>43606152 #

8. flutas ◴[07 Apr 25 00:08 UTC] No.43606152{4}[source]▶

>>43604169 #

I manually pre-parsed the bytecode file with awk into a bazillion individual files that were each just one function, and gave it the hint to grep to sort through them. This was all done in Cursor.

    awk '/^=> \[Function #/ {           
        if (out) close(out);
        fn = $0; sub(/^.*#/, "", fn); sub(/ .*/, "", fn);
        out = "function_" fn ".txt"
    }
    { if (out) print > out }' bundle.hasm

Quick example of the output it gave and it's process.

https://i.imgur.com/Cmg4KK1.png

https://i.imgur.com/ApNxUkB.png

9. ponorin ◴[07 Apr 25 11:50 UTC] No.43610218[source]▶

>>43603886 (TP) #

There's somehow this belief that "newer models will disprove <insert LLM criticism here>" despite the "newer" models being... just a scaled-up version of a previous model, or some anciliary features tacked on. An LLM is an LLM is an LLM: I'll believe it when I see otherwise.

10. dimitri-vs ◴[07 Apr 25 12:37 UTC] No.43610584[source]▶

>>43603961 #

This is exactly my sentiment. Sonnet-3.5-latest was the perfect code companion: wrote just the right amount of okay quality code but its strength was it really tried to adhere to your instructions. sonnet-3.7 was the exact opposite, wrote waaay too much code and overengineered things like crazy while having very poor instruction adherence. Gemini 2.5 Pro is basically what I hoped sonnet-3.7 would be: follows instructions well but still softly opinionated, massive (usable) context window, fast response, more biased towards latest best practices and a up to date knowledge cutoff.

I'm wondering how much gemini 2.5 being "amazing" comes from sonnet-3.7 being such a disappointment.