←back to thread

579 points paulpauper | 3 comments | | HN request time: 0s | source
Show context
iambateman ◴[] No.43604241[source]
The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.

When you ask it a question, it tends to say yes.

So while the LLM arms race is incrementally increasing benchmark scores, those improvements are illusory.

The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving. So even if the model gets an extra 5/100 math problems right, it feels about the same in a series of prompts which are more complicated than just a ChatGPT scenario.

I would say the industry knows it’s missing a tool but doesn’t know what that tool is yet. Truly agentic performance is getting better (Cursor is amazing!) but it’s still evolving.

I totally agree that the core benchmarks that matter should be ones which evaluate a model in agentic scenario, not just on the basis of individual responses.

replies(5): >>43605173 #>>43607461 #>>43608679 #>>43612148 #>>43612608 #
bluefirebrand ◴[] No.43605173[source]
> The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving

LLMs fundamentally do not want to seem anything

But the companies that are training them and making models available for professional use sure want them to seem agreeable

replies(4): >>43606451 #>>43608591 #>>43610470 #>>43616935 #
1. mrweasel ◴[] No.43610470[source]
That sound reasonable to me, but the those companies forget that there's different types of agreeable. There's the LLM approach, similar to the coworker who will answer all your questions about .NET but not stop you from coding yourself into a corner, and then there's the "Let's sit down and review what it actually is that you're doing, because you're asking a fairly large number of disjoint questions right now".

I've dropped trying to use LLMs for anything, due to political convictions and because I don't feel like they are particularly useful for my line of work. Where I have tried to use various models in the past is for software development, and the common mistake I see the LLMs make is that they can't pick up on mistakes in my line of thinking, or won't point them out. Most of my problems are often down to design errors or thinking about a problem in a wrong way. The LLMs will never once tell me that what I'm trying to do is an indication of a wrong/bad design. There are ways to be agreeable and still point out problems with previously made decisions.

replies(1): >>43611495 #
2. squiggleblaz ◴[] No.43611495[source]
I think it's your responsibility to control the LLM. Sometimes, I worry that I'm beginning to code myself into a corner, and I ask if this is the dumbest idea it's ever heard and it says there might be a better way to do it. Sometimes I'm totally sceptical and ask that question first thing. (Usually it hallucinates when I'm being really obtuse though, and in a bad case that's the first time I notice it.)
replies(1): >>43630143 #
3. namaria ◴[] No.43630143[source]
> I think it's your responsibility to control the LLM.

Yes. The issue here is control and NLP is a poor interface to exercise control over the computer. Code on the other hand is a great way. That is the whole point of skepticism around LLM in software development.