←back to thread

579 points paulpauper | 5 comments | | HN request time: 0.649s | source
1. HarHarVeryFunny ◴[] No.43604258[source]
The disconnect between improved benchmark results and lack of improvement on real world tasks doesn't have to imply cheating - it's just a reflection of the nature of LLMs, which at the end of the day are just prediction systems - these are language models, not cognitive architectures built for generality.

Of course, if you train an LLM heavily on narrow benchmark domains then its prediction performance will improve on those domains, but why would you expect that to improve performance in unrelated areas?

If you trained yourself extensively on advanced math, would you expect that to improve your programming ability? If not, they why would you expect it to improve programming ability of a far less sophisticated "intelligence" (prediction engine) such as a language model?! If you trained yourself on LeetCode programming, would you expect that to help hardening corporate production systems?!

replies(3): >>43604700 #>>43608735 #>>43610107 #
2. InkCanon ◴[] No.43604700[source]
That's fair. But look up the recent experiment on SOTA models on the then just released USAMO 2025 questions. Highest score was 5%, supposedly SOTA last year was IMO silver level. There could be some methodological differences - ie USAMO paper required correct proofs and not just numerical answers. But it really strongly suggests even within limited domains, it's cheating. I'd wager a significant amount that if you tested SOTA models on a new ICPC set of questions, actual performance would be far, far worse than their supposed benchmarks.
replies(1): >>43605159 #
3. usaar333 ◴[] No.43605159[source]
> Highest score was 5%, supposedly SOTA last year was IMO silver level.

No LLM last year got silver. Deepmind had a highly specialized AI system earning that

4. KolibriFly ◴[] No.43608735[source]
Your analogy is perfect. Training an LLM on math olympiad problems and then expecting it to secure enterprise software is like teaching someone chess and handing them a wrench
5. throwawayffffas ◴[] No.43610107[source]
In my view as well it's not really cheating, it's just over fitting.

If a model doesn't do good in the benchmarks it will either be retrained until it does or you won't hear about it.