←back to thread

579 points paulpauper | 1 comments | | HN request time: 0.247s | source
Show context
iambateman ◴[] No.43604241[source]
The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.

When you ask it a question, it tends to say yes.

So while the LLM arms race is incrementally increasing benchmark scores, those improvements are illusory.

The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving. So even if the model gets an extra 5/100 math problems right, it feels about the same in a series of prompts which are more complicated than just a ChatGPT scenario.

I would say the industry knows it’s missing a tool but doesn’t know what that tool is yet. Truly agentic performance is getting better (Cursor is amazing!) but it’s still evolving.

I totally agree that the core benchmarks that matter should be ones which evaluate a model in agentic scenario, not just on the basis of individual responses.

replies(5): >>43605173 #>>43607461 #>>43608679 #>>43612148 #>>43612608 #
bluefirebrand ◴[] No.43605173[source]
> The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving

LLMs fundamentally do not want to seem anything

But the companies that are training them and making models available for professional use sure want them to seem agreeable

replies(4): >>43606451 #>>43608591 #>>43610470 #>>43616935 #
1. ◴[] No.43616935[source]