Recent AI model progress feels mostly like bullshit

(www.lesswrong.com)

579 points paulpauper | 1 comments | 06 Apr 25 18:01 UTC | HN request time: 0.377s | source

Show context

iambateman ◴[06 Apr 25 19:34 UTC] No.43604241[source]▶

The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.

When you ask it a question, it tends to say yes.

So while the LLM arms race is incrementally increasing benchmark scores, those improvements are illusory.

The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving. So even if the model gets an extra 5/100 math problems right, it feels about the same in a series of prompts which are more complicated than just a ChatGPT scenario.

I would say the industry knows it’s missing a tool but doesn’t know what that tool is yet. Truly agentic performance is getting better (Cursor is amazing!) but it’s still evolving.

I totally agree that the core benchmarks that matter should be ones which evaluate a model in agentic scenario, not just on the basis of individual responses.

replies(5): >>43605173 #>>43607461 #>>43608679 #>>43612148 #>>43612608 #

boesboes ◴[07 Apr 25 15:27 UTC] No.43612608[source]▶

>>43604241 #

This rings true. What I notice is that the longer i let Claude work on some code for instance, the more bullshit it invents. I usually can delete about 50-60% of the code & tests it came up with.

And when you ask it to 'just write a test' 50/50 it will try to run it, fail on some trivial issues, delete 90% of your test code and start to loop deeper and deeper into the rabit hole of it's own halliciations.

Or maybe I just suck at prompting hehe

replies(1): >>43630164 #

1. namaria ◴[09 Apr 25 08:49 UTC] No.43630164[source]▶

>>43612608 #

> Or maybe I just suck at prompting hehe

Every time someone argues for the utility of LLMs in software development by saying you need to be better at prompting, or add more rules for the LLM on the repository, they are making an argument against using NLP in software development.

The whole point of code is that it is a way to be very specific and exact and to exercise control over the computer behavior. The entire value proposition of using an LLM is that it is easier because you don't need to be so specific and exact. If then you say you need to be more specific and exact with the prompting, you are slowly getting at the fact that using NLP for coding is a bad idea.

↑