←back to thread

188 points gkamradt | 1 comments | | HN request time: 0s | source
Show context
danpalmer ◴[] No.43466879[source]
> and was the only benchmark to pinpoint the exact moment in late 2024 when AI moved beyond pure memorization

This is self-referential, the benchmark pinpointed the time when AI went from memorization to problem solving, because the benchmark requires problem solving to complete. How do we know it requires problem solving skills? Because memorization-only LLMs can't do it but humans can.

I think ARC are producing some great benchmarks, and I think they probably are pushing forward the state of the art, however I don't think they identified anything particular with o3, at least they don't seem to have proven a step change.

replies(1): >>43466922 #
fchollet ◴[] No.43466922[source]
The reason these tasks require fluid intelligence is because they were designed this way -- with task uniqueness/novelty as the primary goal.

ARC 1 was released long before in-context learning was identified in LLMs (and designed before Transformer-based LLMs existed), so the fact that LLMs can't do ARC was never a design consideration. It just turned out this way, which confirmed our initial assumption.

replies(2): >>43467080 #>>43467479 #
danpalmer ◴[] No.43467479[source]
Is there any other confirmation of the assumptions, other than the LLM behaviour, because that still feels like circular reasoning.

I think a similar claim could be levelled against other benchmarks or LLM evaluation tasks. One could say that the Turing test was designed to assess human intelligence, and LLMs pass it, therefore LLMs have human intelligence. This is generally considered to be false now, because we can plainly see that LLMs do not have intelligence in the same way as humans (yet? debatable, not the point), and instead we concluded that the Turing test was not the right benchmark. That's not to diminish its importance, it was hugely important as a part of AI education and possibly even AI development for decades.

ARC does seem to be pushing the boundaries, I'm just not convinced that it's testing a provable step change.

replies(1): >>43469115 #
1. JFingleton ◴[] No.43469115{3}[source]
I'm not sure that's quite correct about the Turing test. From Wikipedia:

"Turing did not explicitly state that the Turing test could be used as a measure of "intelligence", or any other human quality. He wanted to provide a clear and understandable alternative to the word "think", which he could then use to reply to criticisms of the possibility of "thinking machines" and to suggest ways that research might move forward."