Recent AI model progress feels mostly like bullshit

1. joelthelion ◴[06 Apr 25 19:17 UTC] No.43604124[source]▶

I've used gemini 2.5 this weekend with aider and it was frighteningly good.

It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.

replies(3): >>43604165 #>>43604223 #>>43609819 #

2. jchw ◴[06 Apr 25 19:23 UTC] No.43604165[source]▶

>>43604124 (TP) #

I think overall quality with Gemini 2.5 is not much better than Gemini 2 in my experience. Gemini 2 was already really good, but just like Claude 3.7, Gemini 2.5 goes some steps forward and some steps backwards. It sometimes generates some really verbose code even when you tell it to be succinct. I am pretty confident that if you evaluate 2.5 for a bit longer you'll come to the same conclusion eventually.

3. mountainriver ◴[06 Apr 25 19:32 UTC] No.43604223[source]▶

>>43604124 (TP) #

Yep, and what they are going in cursor either the agentic stuff is really game changing.

People who can’t recognize this intentionally have their heads in the sand

replies(2): >>43604648 #>>43610461 #

4. InkCanon ◴[06 Apr 25 20:21 UTC] No.43604648[source]▶

>>43604223 #

People are really fundamentally asking two different questions when they talk about AI "importance": AI's utility and AI's "intelligence". There's a careful line between both.

1) AI undoubtedly has utility. In many agentic uses, it has very significant utility. There's absolute utility and perceived utility, which is more of user experience. In absolute utility, it is likely git is the single most game changing piece of software there is. It is likely git has saved some ten, maybe eleven digit number in engineer hours times salary in how it enables massive teams to work together in very seamless ways. In user experience, AI is amazing because it can generate so much so quickly. But it is very far from an engineer. For example, recently I tried to use cursor to bootstrap a website in NextJS for me. It produced errors it could not fix, and each rewrite seemed to dig it deeper into its own hole. The reasons were quite obvious. A lot of it had to do with NextJS 15 and the breaking changes it introduces in cookies and auth. It's quite clear if you have masses of NextJS code, which disproportionately is older versions, but none labeled well with versions, it messes up the LLM. Eventually I scrapped what it wrote and did it myself. I don't mean to use this anecdote to say LLMs are useless, but they have pretty clear limitations. They work well on problems with massive data (like front end) and don't require much principled understanding (like understanding how NextJS 15 would break so and so's auth). Another example of this is when I tried to use it to generate flags for a V8 build, it failed horribly and would simply hallucinate flags all the time. This seemed very likely to be (despite the existence of a list of V8 flags online) that many flags had very close representations in vector embeddings, and that there was almost close to zero data/detailed examples on their use.

2) In the more theoretical side, the performance of LLMs on benchmarks (claiming to be elite IMO solvers, competitive programming solvers) have become incredibly suspicious. When the new USAMO 2025 was released, the highest score was 5%, despite claims a year ago that SOTA when was at least a silver IMO. This is against the backdrop of exponential compute and data being fed in. Combined with apparently diminishing returns, this suggests that the gains from that are running really thin.

5. heresie-dabord ◴[07 Apr 25 10:40 UTC] No.43609819[source]▶

>>43604124 (TP) #

> It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.

Even approximations must be right to be meaningful. If information is wrong, it's rubbish.

Presorting/labelling various data has value. Humans have done the real work there.

What is "leading" us at present are the exaggerated valuations of corporations. You/we are in a bubble, working to justify the bubble.

Until a tool is reliable, it is not installed where people can get hurt. Unless we have revised our concern for people.

6. dimitri-vs ◴[07 Apr 25 12:20 UTC] No.43610461[source]▶

>>43604223 #

I guess you haven't been on /r/cursor or forum.cursor.com lately?

"game changing" isn't exactly the sentiment there the last couple months.