Most active commenters

lukev(3)
otterley(3)

Popular/hot comments

>>43609144 #

←back to thread

Recent AI model progress feels mostly like bullshit

(www.lesswrong.com)

1. lukev ◴[06 Apr 25 19:34 UTC] No.43604244[source]▶

>>43603453 (OP) #

This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven.

I have my own opinions, but I can't really say that they're not also based on anecdotes and personal decision-making heuristics.

But some of us are going to end up right and some of us are going to end up wrong and I'm really curious what features signal an ability to make "better choices" w/r/t AI, even if we don't know (or can't prove) what "better" is yet.

replies(10): >>43604396 #>>43604472 #>>43604738 #>>43604923 #>>43605009 #>>43605865 #>>43606458 #>>43608665 #>>43609144 #>>43612137 #

2. lherron ◴[06 Apr 25 19:52 UTC] No.43604396[source]▶

>>43604244 (TP) #

Agreed! And with all the gaming of the evals going on, I think we're going to be stuck with anecdotal for some time to come.

I do feel (anecdotally) that models are getting better on every major release, but the gains certainly don't seem evenly distributed.

I am hopeful the coming waves of vertical integration/guardrails/grounding applications will move us away from having to hop between models every few weeks.

replies(1): >>43604540 #

3. FiniteIntegral ◴[06 Apr 25 20:00 UTC] No.43604472[source]▶

>>43604244 (TP) #

It's not surprising that responses are anecdotal. An easy way to communicate a generic sentiment often requires being brief.

A majority of what makes a "better AI" can be condensed to how effective the slope-gradient algorithms are at getting the local maxima we want it to get to. Until a generative model shows actual progress of "making decisions" it will forever be seen as a glorified linear algebra solver. Generative machine learning is all about giving a pleasing answer to the end user, not about creating something that is on the level of human decision making.

replies(1): >>43608159 #

4. InkCanon ◴[06 Apr 25 20:07 UTC] No.43604540[source]▶

>>43604396 #

Frankly the overarching story about evals (which receives very little coverage) is how much gaming is going on. On the recent USAMO 2025, SOTA models scored 5%, despite claiming silver/gold in IMOs. And ARC-AGI: one very easy way to "solve" it is to generate masses of synthetic examples by extrapolating the basic rules of ARC AGI questions and train it on that.

5. ◴[06 Apr 25 20:33 UTC] No.43604738[source]▶

>>43604244 (TP) #

6. nialv7 ◴[06 Apr 25 21:00 UTC] No.43604923[source]▶

>>43604244 (TP) #

Good observation but also somewhat trivial. We are not omniscient gods, ultimately all our opinions and decisions will have to be based on our own limited experiences.

7. freehorse ◴[06 Apr 25 21:14 UTC] No.43605009[source]▶

>>43604244 (TP) #

There is nothing wrong with sharing anecdotal experiences. Reading through anecdotal experiences here can help understand how one's own experience are relatable or not. Moreover, if I have X experience it could help to know if it is because of me doing sth wrong that others have figured out.

Furthermore, as we are talking about actual impact of LLMs, as is the point of the article, a bunch of anecdotal experiences may be more valuable than a bunch of benchmarks to figure it out. Also, apart from the right/wrong dichotomy, people use LLMs with different goals and contexts. It may not mean that some people do something wrong if they do not see the same impact as others. Everytime a web developer says that they do not understand how others may be so skeptical of LLMs, conclude with certainty that they must be doing sth wrong and move on to explain how to actually use LLMs properly, I chuckle.

replies(1): >>43605998 #

8. aunty_helen ◴[06 Apr 25 23:26 UTC] No.43605865[source]▶

>>43604244 (TP) #

That’s a good point, the comments section is very anecdotal. Do you have any data to say if this is a common occurrence or specific to this topic?

9. otterley ◴[06 Apr 25 23:43 UTC] No.43605998[source]▶

>>43605009 #

Indeed, there’s nothing at all wrong with sharing anecdotes. The problem is when people make broad assumptions and conclusions based solely on personal experience, which unfortunately happens all too often. Doing so is wired into our brains, though, and we have to work very consciously to intercept our survival instincts.

replies(2): >>43607663 #>>43610691 #

10. throwanem ◴[07 Apr 25 01:10 UTC] No.43606458[source]▶

>>43604244 (TP) #

> I'm really curious what features signal an ability to make "better choices" w/r/t AI

So am I. If you promise you'll tell me after you time travel to the future and find out, I'll promise you the same in return.

11. droopyEyelids ◴[07 Apr 25 04:23 UTC] No.43607663{3}[source]▶

>>43605998 #

I think you might be caught up in a bit of the rationalist delusion.

People -only!- draw conclusions based on personal experience. At best you have personal experience with truly objective evidence gathered in a statistically valid manner.

But that only happens in a few vanishingly rare circumstances here on earth. And wherever it happens, people are driven to subvert the evidence gathering process.

Often “working against your instincts” to be more rational only means more time spent choosing which unreliable evidence to concoct a belief from.

replies(1): >>43608908 #

12. code_biologist ◴[07 Apr 25 05:45 UTC] No.43608159[source]▶

>>43604472 #

At risk of being annoying, answers that feel like high quality human decision making are extremely pleasing and desirable. In the same way, image generators aren't generating six fingered hands because they think it's more pleasing, they're doing it because they're trying to please and not good enough yet.

I'm just most baffled by the "flashes of brilliance" combined with utter stupidity. I remember having a run with early GPT 4 (gpt-4-0314) where it did refactoring work that amazed me. In the past few days I asked a bunch of AIs about similar characters between a popular gacha mobile game and a popular TV show. OpenAI's models were terrible and hallucinated aggressively (4, 4o, 4.5, o3-mini, o3-mini-high), with the exception of o1. DeepSeek R1 only mildly hallucinated and gave bad answers. Gemini 2.5 was the only flagship model that did not hallucinate and gave some decent answers.

I probably should have used some type of grounding, but I honestly assumed the stuff I was asking about should have been in their training datasets.

13. KolibriFly ◴[07 Apr 25 07:07 UTC] No.43608665[source]▶

>>43604244 (TP) #

Totally agree... this space is still so new and unpredictable that everyone is operating off vibes, gut instinct, and whatever personal anecdotes they've collected. We're all sort of fumbling around in the dark, trying to reverse-engineer the flashlight

14. otterley ◴[07 Apr 25 07:43 UTC] No.43608908{4}[source]▶

>>43607663 #

I'm not sure where you got all this from. Do you have any useful citations?

15. dsign ◴[07 Apr 25 08:24 UTC] No.43609144[source]▶

>>43604244 (TP) #

You want to block subjectivity? Write some formulas.

There are three questions to consider:

a) Have we, without any reasonable doubt, hit a wall for AI development? Emphasis on "reasonable doubt". There is no reasonable doubt that the Earth is roughly spherical. That level of certainty.

b) Depending on your answer for (a), the next question to consider is if we the humans have motivations to continue developing AI.

c) And then the last question: will AI continue improving?

If taken as boolean values, (a), (b) and (c) have a truth table with eight values, the most interesting row being false, true, true: "(not a) and b => c". Note the implication sign, "=>". Give some values to (a) and (b), and you get a value for (c).

There are more variables you can add to your formula, but I'll abstain from giving any silly examples. I, however, think that the row (false, true, false) implied by many commentators is just fear and denial. Fear is justified, but denial doesn't help.

replies(3): >>43613592 #>>43620474 #>>43630306 #

16. freehorse ◴[07 Apr 25 12:51 UTC] No.43610691{3}[source]▶

>>43605998 #

People "make conclusions" because they have to take decisions day to day. We cannot wait for the perfect bulletproof evidence before that. Data is useful to take into account, but if I try to use X llm that has some perfect objective benchmark backing it, while I cannot make it be useful to me while Y llm has better results, it would be stupid not to base my decision on my anecdotal experience. Or vice versa, if I have a great workflow with llms, it may be not make sense to drop it because some others may think that llms don't work.

In the absence of actually good evidence, anecdotal data may be the best we can get now. The point imo is try to understand why some anecdotes are contrasting each other, which, imo, is mostly due to contextual factors that may not be very clear, and to be flexible enough to change priors/conclusions when something changes in the current situation.

replies(1): >>43614015 #

17. ramesh31 ◴[07 Apr 25 14:45 UTC] No.43612137[source]▶

>>43604244 (TP) #

>"This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven."

People having vastly different opinions on AI simply comes down to token usage. If you are using millions of tokens on a regular basis, you completely understand the revolutionary point we are at. If you are just chatting back and forth a bit with something here and there, you'll never see it.

replies(2): >>43612388 #>>43616322 #

18. antonvs ◴[07 Apr 25 15:07 UTC] No.43612388[source]▶

>>43612137 #

It's a tool and like all tools, it's sensitive to how you use it, and it's better for some purposes than others.

Someone who lacks experience, skill, training, or even the ability to evaluate results may try to use a tool and blame the tool when it doesn't give good results.

That said, the hype around LLMs certainly overstates their capabilities.

19. lukev ◴[07 Apr 25 17:02 UTC] No.43613592[source]▶

>>43609144 #

Invalid expression: value of type "probability distribution" cannot be cast to type "boolean".

20. otterley ◴[07 Apr 25 17:44 UTC] No.43614015{4}[source]▶

>>43610691 #

Agreed 100%. When insufficient data exists, you have to fall back to other sources like analogies, personal observations, secondhand knowledge, etc. However, I’ve seen too many instances of people claiming their own limited experience is the truth when overwhelming and easily attainable evidence and data exists that proves it to be false.

21. lukev ◴[07 Apr 25 21:49 UTC] No.43616322[source]▶

>>43612137 #

So this is interesting because it's anecdotal (I presume you're a high-token user who believes it's revolutionary), but it's actually a measurable, falsifiable hypothesis in principle.

I'd love to see a survey from a major LLM API provider that correlated LLM spend (and/or tokens) with optimism for future transformativity. Correlation with a view of "current utility" would be a tautology, obviously.

I actually have the opposite intuition from you: I suspect the people using the most tokens are using it for very well-defined tasks that it's good at _now_ (entity extraction, classification, etc) and have an uncorrelated position on future potential. Full disclosure, I'm in that camp.

replies(1): >>43617460 #

22. ramesh31 ◴[08 Apr 25 01:13 UTC] No.43617460{3}[source]▶

>>43616322 #

Token usage meaning via agentic processes. Essentially every gripe about LLMs over the last few years (hallucinations, lack of real time data, etc.) was a result of single shot prompting directly to models. No one is seriously doing that for anything at this point anymore. Yes, you spend ten times more on a task, and it takes much longer. But your results are meaningful and useful at the end, and you can actually begin to engineer systems on top of that now.

23. pdimitar ◴[08 Apr 25 11:32 UTC] No.43620474[source]▶

>>43609144 #

A lot of people judge by the lack of their desired outcome. Calling that fear and denial is disingenuous and unfair.

replies(1): >>43624123 #

24. dsign ◴[08 Apr 25 17:15 UTC] No.43624123{3}[source]▶

>>43620474 #

That's actually a valid point. I stand corrected.

25. namaria ◴[09 Apr 25 09:19 UTC] No.43630306[source]▶

>>43609144 #

If you're gonna formulate this conversation as a satisfiability problem you should be aware that this is an NP-complete problem (and actually working on that problem is the source of the insight that there is such as thing as NP-completeness).

↑