Recent AI model progress feels mostly like bullshit

People are really fundamentally asking two different questions when they talk about AI "importance": AI's utility and AI's "intelligence". There's a careful line between both.

1) AI undoubtedly has utility. In many agentic uses, it has very significant utility. There's absolute utility and perceived utility, which is more of user experience. In absolute utility, it is likely git is the single most game changing piece of software there is. It is likely git has saved some ten, maybe eleven digit number in engineer hours times salary in how it enables massive teams to work together in very seamless ways. In user experience, AI is amazing because it can generate so much so quickly. But it is very far from an engineer. For example, recently I tried to use cursor to bootstrap a website in NextJS for me. It produced errors it could not fix, and each rewrite seemed to dig it deeper into its own hole. The reasons were quite obvious. A lot of it had to do with NextJS 15 and the breaking changes it introduces in cookies and auth. It's quite clear if you have masses of NextJS code, which disproportionately is older versions, but none labeled well with versions, it messes up the LLM. Eventually I scrapped what it wrote and did it myself. I don't mean to use this anecdote to say LLMs are useless, but they have pretty clear limitations. They work well on problems with massive data (like front end) and don't require much principled understanding (like understanding how NextJS 15 would break so and so's auth). Another example of this is when I tried to use it to generate flags for a V8 build, it failed horribly and would simply hallucinate flags all the time. This seemed very likely to be (despite the existence of a list of V8 flags online) that many flags had very close representations in vector embeddings, and that there was almost close to zero data/detailed examples on their use.

2) In the more theoretical side, the performance of LLMs on benchmarks (claiming to be elite IMO solvers, competitive programming solvers) have become incredibly suspicious. When the new USAMO 2025 was released, the highest score was 5%, despite claims a year ago that SOTA when was at least a silver IMO. This is against the backdrop of exponential compute and data being fed in. Combined with apparently diminishing returns, this suggests that the gains from that are running really thin.