Most active commenters
  • gkamradt(6)
  • fchollet(6)
  • YeGoblynQueenne(5)
  • Chathamization(4)
  • zamadatix(3)
  • Davidzheng(3)
  • jononor(3)
  • fastball(3)

188 points gkamradt | 103 comments | | HN request time: 2.602s | source | bottom
1. gkamradt ◴[] No.43465162[source]
Hey HN, Greg from ARC Prize Foundation here.

Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.

In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI's o3.

ARC-AGI-2 targets test-time reasoning.

My view is that good AI benchmarks don't just measure progress, they inspire it. Our mission is to guide research towards general systems.

Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are <4%.

Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.

Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.

Change log from ARC-AGI-2 to ARC-AGI-2: * The two main evaluation sets (semi-private, private eval) have increased to 120 tasks * Solving tasks requires more reasoning vs pure intuition * Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less * Non-training task sets are now difficulty-calibrated

The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.

The Kaggle competition goes live later this week and you can sign up here: https://arcprize.org/competition

We're in an idea-constrained environment. The next AGI breakthrough might come from you, not a giant lab.

Happy to answer questions.

replies(13): >>43465254 #>>43466394 #>>43466647 #>>43467579 #>>43467810 #>>43468015 #>>43468067 #>>43468081 #>>43468268 #>>43468318 #>>43468455 #>>43468706 #>>43468931 #
2. artninja1988 ◴[] No.43465254[source]
What are you doing to prevent the test set being leaked? Will you still be offering API access to the semi private test set to the big model providers who presumably train on their API?
replies(1): >>43465360 #
3. gkamradt ◴[] No.43465360{3}[source]
We have a few sets:

1. Public Train - 1,000 tasks that are public 2. Public Eval - 120 tasks that are public

So for those two we don't have protections.

3. Semi Private Eval - 120 tasks that are exposed to 3rd parties. We sign data agreements where we can, but we understand this is exposed and not 100% secure. It's a risk we are open to in order to keep testing velocity. In theory it is very difficulty to secure this 100%. The cost to create a new semi-private test set is lower than the effort needed to secure it 100%.

4. Private Eval - Only on Kaggle, not exposed to any 3rd parties at all. Very few people have access to this. Our trust vectors are with Kaggle and the internal team only.

replies(1): >>43466602 #
4. artificialprint ◴[] No.43465860[source]
Oh boy! Some of these tasks are not hard, but require full attention and a lot of counting just to get things right! ARC3 will go 3D perhaps? JK

Congrats on launch, lets see how long it'll take to get saturated

replies(2): >>43465929 #>>43466945 #
5. fchollet ◴[] No.43465929[source]
ARC 3 is still spatially 2D, but it adds a time dimension, and it's interactive.
replies(3): >>43466406 #>>43466916 #>>43466966 #
6. gmkhf ◴[] No.43466394[source]
I think a lot of people got discouraged, seeing how openai solved arc agi 1 by what seems like brute forcing and throwing money at it. Do you believe arc was solved in the "spirit" of the challenge? Also all the open sourced solutions seem super specific to solving arc. Is this really leading us to human level AI at open ended tasks?
replies(1): >>43466786 #
7. artninja1988 ◴[] No.43466406{3}[source]
I think a lot of people got discouraged, seeing how openai solved arc agi 1 by what seems like brute forcing and throwing money at it. Do you believe arc was solved in the "spirit" of the challenge? Also all the open sourced solutions seem super specific to solving arc. Is this really leading us to human level AI at open ended tasks?
replies(2): >>43466887 #>>43467745 #
8. FergusArgyll ◴[] No.43466415[source]
I'd love to hear from the ARC guys:

These benchmarks, and specifically the constraints placed on solving them (compute etc) seem to me to incentivize the opposite of "general intelligence"

Have any of the technical contributions used to win the past competition been used to advance general AI in any way?

We have transformer based systems constantly gaining capabilities. On the other hand have any of the Kaggle submissions actually advanced the field in any way outside of the ARC Challenge?

To me (a complete outsider, admittedly) the ARC prize seems like an operationalization of the bitter lesson

replies(2): >>43466619 #>>43469318 #
9. Nesco ◴[] No.43466570[source]
At the very first glance, it's like ARC 1 with some structures serving as contextual data, and more complicated symmetries / topological transformations.

Now, I wonder what surprises are to be found in the full dataset.

The focus on solving cost efficiently discrete tasks might actually lead us towards deep learning systems that could be used reliably in production, and not just give a whoa effect or need to be constantly supervised

10. zamadatix ◴[] No.43466602{4}[source]
What prevents everything in 4 from becoming a part of 3 the first time the test set is run on a proprietary model, do you require competitors like OpenAI provide models Kaggle can self host for the test?
replies(1): >>43466626 #
11. gkamradt ◴[] No.43466619[source]
Good question! This was one of the main motivations of our "Paper Prize" track. We wanted to reward conceptual progress vs leaderboard chasing. In fact, when we increased the prizes mid year we awarded more money towards the paper track vs top score.

We had 40 papers submitted last year and 8 were awarded prizes. [1]

On of the main teams, MindsAI, just published their paper on their novel test time fine tuning approach. [2]

Jan/Daniel (1st place winners last year) talk all about their progress and journey building out here [3]. Stories like theirs help push the field forward.

[1] https://arcprize.org/blog/arc-prize-2024-winners-technical-r...

[2] https://github.com/MohamedOsman1998/deep-learning-for-arc/bl...

[3] https://www.youtube.com/watch?v=mTX_sAq--zY

12. gkamradt ◴[] No.43466626{5}[source]
#4 (private test set) doesn't get used for any public model testing. It is only used on the Kaggle leaderboard where no internet access is allowed.
replies(1): >>43466669 #
13. lawrenceyan ◴[] No.43466633[source]
Concrete benchmarks like these are very valuable.

Defining the reward function, which is basically what ARC is doing, is 50% of the problem solving process.

14. synapsomorphy ◴[] No.43466647[source]
Thanks for your awesome work Greg!

The success of o3 directly contradicts us being in an "idea-constrained environment", what makes you believe that?

replies(2): >>43468917 #>>43469112 #
15. zamadatix ◴[] No.43466669{6}[source]
Sorry, I probably phrased the question poorly. My question is more along the lines of "when you already scored e.g. OpenAI's o3 on ARC AGI 2 how did you guarantee OpenAI can't just look at its server logs to see question set 4"?
replies(1): >>43466703 #
16. ipunchghosts ◴[] No.43466702[source]
The computer vision community needs an dataset like this for evaluation... train in one domain and test on another. The best we have now are thr imagenet r and c datasets. Humans have no issues with domain adaptation with vision, but comouter vision models struggle in many ways sti including out of domain images.
17. gkamradt ◴[] No.43466703{7}[source]
Ah yes, two things

1. We had a no-data retention agreement with them. We were assured by the highest level of their company + security division that the box our test was run on would be wiped after testing

2. We only tested o3 against the semi-private set. We didn't test it with the private eval.

replies(3): >>43466704 #>>43467095 #>>43467449 #
18. zamadatix ◴[] No.43466704{8}[source]
Makes sense, particularly part 2 until "the final results" are needed. Thanks for taking the time to answer my question!
19. jmtulloss ◴[] No.43466786{3}[source]
Why is this the same comment as https://news.ycombinator.com/item?id=43466406?
20. momojo ◴[] No.43466798[source]
Have you had any neurologists utilize your dataset? My own reaction after solving a few of the puzzles was "Why is this so intuitive for me, but not for an LLM?".

Our human-ability to abstract things is underrated.

replies(1): >>43466902 #
21. danpalmer ◴[] No.43466879[source]
> and was the only benchmark to pinpoint the exact moment in late 2024 when AI moved beyond pure memorization

This is self-referential, the benchmark pinpointed the time when AI went from memorization to problem solving, because the benchmark requires problem solving to complete. How do we know it requires problem solving skills? Because memorization-only LLMs can't do it but humans can.

I think ARC are producing some great benchmarks, and I think they probably are pushing forward the state of the art, however I don't think they identified anything particular with o3, at least they don't seem to have proven a step change.

replies(1): >>43466922 #
22. fchollet ◴[] No.43466887{4}[source]
It's useful to know what current AI systems can achieve with unlimited test-time compute resources. Ultimately though, the "spirit of the challenge" is efficiency, which is why we're specifically looking for solutions that are at least within 1-2 order of magnitude of cost from being competitive with humans. The Kaggle leaderboard is very resource-constrained, and on the public leaderboard you need to use less than $10,000 in compute to solve 120 tasks.
replies(1): >>43468035 #
23. fchollet ◴[] No.43466902[source]
There have been some human studies on ARC 1 previously, I expect there will be more in the future. See this paper from 2021, which was one of the earliest works in this direction: https://arxiv.org/abs/2103.05823
24. iandanforth ◴[] No.43466912[source]
I'd very much like to see VLAs get in the game with ARC. When I solve these puzzles I'm imagining myself move blocks around. Much of the time I'm treating these as physics simulations with custom physics per puzzle. VLAs are particularly well suited to the kind of training and planning which might unlock solutions here.
25. Vecr ◴[] No.43466916{3}[source]
If you aren't joking, that will filter most humans.
replies(1): >>43466995 #
26. fchollet ◴[] No.43466922[source]
The reason these tasks require fluid intelligence is because they were designed this way -- with task uniqueness/novelty as the primary goal.

ARC 1 was released long before in-context learning was identified in LLMs (and designed before Transformer-based LLMs existed), so the fact that LLMs can't do ARC was never a design consideration. It just turned out this way, which confirmed our initial assumption.

replies(2): >>43467080 #>>43467479 #
27. daemonologist ◴[] No.43466945[source]
The "select" tool gives some help with tasks that require counting or copying. You can select areas of the input, which will show their dimensions, and copy-paste them into the output (ctrl+c/ctrl+v).
28. christianqchung ◴[] No.43466966{3}[source]
Are you in the process of creating tasks that behave as an acid test for AGI? If not, do you think such a task is feasible? I read somewhere in the ARC blog that they define AGI as when creating tasks that is hard for AI but easy for humans becomes virtually impossible.
29. neom ◴[] No.43466980[source]
Maybe this is a really stupid question but I've been curious... are LLMs based on... "Neuronormativity"? Like, what neurology is an LLM based on? Would we get any benefit from looking at neurodiverse processing styles?
replies(1): >>43467250 #
30. wmf ◴[] No.43466995{4}[source]
They said at least two people out of 400 solved each problem so they're pretty hard.
replies(1): >>43468990 #
31. jwpapi ◴[] No.43467046[source]
Did we run out of textual tasks that are easy for humans but hard for AI, or why are the examples all graphics?
replies(2): >>43467083 #>>43468943 #
32. YeGoblynQueenne ◴[] No.43467080{3}[source]
>> The reason these tasks require fluid intelligence is because they were designed this way -- with task uniqueness/novelty as the primary goal.

That's in no way different than claiming that LLMs understand language, or reason, etc, because they were designed that way.

Neural nets of all sorts have been beating benchmarks since forever, e.g. there's a ton of language understanding benchmarks pretty much all saturated by now (GLUE, SUPERGLUE ULTRASUPERAWESOMEGLUE ... OK I made that last one up) but passing them means nothing about the ability of neural net-based systems to understand language, regardless of how much their authors designed them to test language understanding.

Failing a benchmark also doesn't mean anything. A few years ago, at the first Kaggle competition, the entries were ad-hoc and amateurish. The first time a well-resourced team tried ARC (OpenAI) they ran roughshod over it and now you have to make a new one.

At some point you have to face the music: ARC is just another benchmark, destined to be beat in good time whenever anyone makes a concentrated effort at it and still prove nothing about intelligence, natural or artificial.

replies(2): >>43467150 #>>43467170 #
33. fchollet ◴[] No.43467083[source]
You can easily convert these tasks to token strings. The reason why ARC does not use language as part of its format is that it seeks to minimize the amount of prior knowledge needed to approach the tasks, so as to focus on fluid intelligence as opposed to acquired knowledge.

All ARC tasks are built entirely on top of "Core Knowledge" priors, the kind of elementary knowledge that a small child has already mastered and that is possessed universally by all humans.

replies(1): >>43489514 #
34. YeGoblynQueenne ◴[] No.43467095{8}[source]
>> We were assured by the highest level of their company + security division that the box our test was run on would be wiped after testing

Yuri Geller assured us he was bending the spoons with his mind. Somehow it was only when the Amazing Randi was present that Yuri Geller couldn't bend the spoons with his mind.

replies(1): >>43467979 #
35. fchollet ◴[] No.43467150{4}[source]
The first time a top lab spent millions trying to beat ARC was actually in 2021, and the effort failed.

By the time OpenAI attempted ARC in 2024, a colossal amount of resources had already been expended trying to beat the benchmark. The OpenAI run itself costs several millions in inference compute alone.

ARC was the only benchmark that highlighted o3 as having qualitatively different abilities compared to all models that came before. o3 is a case of a good approach meeting an appropriate benchmark, rather than an effort to beat ARC specifically.

replies(1): >>43475924 #
36. szvsw ◴[] No.43467170{4}[source]
I mostly agree with what your are saying but…

> passing them means nothing about the ability of neural net-based systems to understand language, regardless of how much their authors designed them to test language understanding.

Does this implicitly suggest that it is impossible to quantitatively assess a system’s ability to understand language? (Using the term “system” in the broadest possible sense)

Not agreeing or disagreeing or asking with skepticism. Genuinely asking what your position is here, since it seems like your comment eventually leads to the conclusion that it is unknowable whether a system external to yourself understands language, or, if it is possible, then only in a purely qualitative way, or perhaps purely in a Stewart-style-pornographic-threshold-test - you’ll know it when you see it.

I don’t have any problem if that’s your position- it might even be mine! I’m more or less of the mindset that debating whether artificial systems can have certain labels attached to them revolving around words like “understanding,” “cognition,” “sentience” etc is generally unhelpful, and it’s much more interesting to just talk about what the actual practical capabilities and functionalities of such systems are on the one hand in a very concrete, observable, hopefully quantitative sense, and how it feels to interact with them in a purely qualitative sense on the other hand. Benchmarks can be useful in the former but not the latter.

Just curious where you fall. How would you recommend we approach the desire to understand whether such systems can “understand language” or “solve problems” etc etc… or are these questions useless in your view? Or only useful in as much as they (the benchmarks/tests etc) drive the development of new methodologies/innovations/measurable capabilities, but not in assigning qualitative properties to said systems?

replies(1): >>43475882 #
37. dcre ◴[] No.43467250[source]
It’s kind of a silly question in that the neural architecture of neural nets is really only loosely inspired by neurology, and that basic vague neurology is shared by neurotypical people and neurodivergent people and animals and even bugs.
replies(1): >>43469330 #
38. falcor84 ◴[] No.43467298[source]
I spent half an hour playing with these now at https://arcprize.org/play and it's fun, but I must say that they are not "easy". So far I eventually solved all of the ones I've gone through, but several took me significantly more than the 2 tries allotted.

I wonder if this can be shown to be a valid IQ test, and if so, what IQ would a person need to solve e.g. 90% of them in 1 or 2 tries.

replies(2): >>43468658 #>>43469139 #
39. Davidzheng ◴[] No.43467363[source]
Probably openai will be >60% in three months if not immediately with these $1000/question level compute (which is the way tbh we should throw compute whenever possible that's the main advantage of silicon intelligence)
replies(1): >>43467372 #
40. Davidzheng ◴[] No.43467372[source]
Their own admission that intelligence is a meaningless metric without bound on compute is one of the main reasons AI will overpower human intelligence soon. Simple scaling is very effective.
41. QuadmasterXLII ◴[] No.43467449{8}[source]
Are you aware that OpenAI brazenly lied and went back on its word about its corporate structure, board governance, and for-profit status, and of the opinion that your data sharing agreement is different and less likely to be ignored? Or are you at step zero where you aren’t considering malfeasance as a possibility at all?
42. danpalmer ◴[] No.43467479{3}[source]
Is there any other confirmation of the assumptions, other than the LLM behaviour, because that still feels like circular reasoning.

I think a similar claim could be levelled against other benchmarks or LLM evaluation tasks. One could say that the Turing test was designed to assess human intelligence, and LLMs pass it, therefore LLMs have human intelligence. This is generally considered to be false now, because we can plainly see that LLMs do not have intelligence in the same way as humans (yet? debatable, not the point), and instead we concluded that the Turing test was not the right benchmark. That's not to diminish its importance, it was hugely important as a part of AI education and possibly even AI development for decades.

ARC does seem to be pushing the boundaries, I'm just not convinced that it's testing a provable step change.

replies(1): >>43469115 #
43. ttol ◴[] No.43467561[source]
Had to give https://reasoner.com a try on ARC-AGI-2.

Reasoner passed on first try.

“Correct!”

(See screenshot that shows one rated “hard” -- https://www.linkedin.com/posts/waynechang_tried-reasoner-on-...)

44. vessenes ◴[] No.43467579[source]
Just want to say I really love these new problems - feels like some general intelligence went into conceiving of and creating these puzzles: we just did a few over dinner as a family.

You have my wheels turning on how to get computers better at these. Looking forward to see G the first computer tech that can get 30-50% on these!

45. mrshadowgoose ◴[] No.43467745{4}[source]
Strong emphasis on "seems".

I'd encourage you to review the definition of "brute force", and then consider the absolutely immense combinatoric space represented by the grids these puzzles use.

"Brute force" simply cannot touch these puzzles. An amount of understanding and pattern recognition is strictly required, even with the large quantities of test-time compute that were used against arc-agi-1.

replies(1): >>43469429 #
46. az226 ◴[] No.43467810[source]
Did any single individual solve all problems? How many such individuals were there?
47. levocardia ◴[] No.43467979{9}[source]
Ironically "I have a magic AI test but nobody is allowed to use it" is a lot closer to the Yuri Geller situation. Tests are meant to be taken, that should be clear. And...maybe this does not apply in the academic domain, but to some extent if you cheat on an AI test "you're only cheating yourself."
replies(1): >>43468143 #
48. levocardia ◴[] No.43468015[source]
I'm really pleased to see this! The original ARC-AGI-1 paper still informs how I think about "what is intelligence" today. I was thrilled to see AI models make real progress on that test precisely when we had the next big idea (reasoning). Here's to hoping round 2 falls with a similarly big breakthrough!
49. Legend2440 ◴[] No.43468035{5}[source]
Efficiency sounds like a hardware problem as much as a software problem.

$10000 in compute is a moving target, today's GPUs are much much better than 10 years ago.

replies(1): >>43468977 #
50. tananaev ◴[] No.43468067[source]
Did I read this right that only 2 humans out of 400 solved the problems?
replies(2): >>43468176 #>>43469493 #
51. doctorpangloss ◴[] No.43468081[source]
Why doesn’t every blogpost contain an example of a question you ask?
52. Jensson ◴[] No.43468143{10}[source]
> but to some extent if you cheat on an AI test "you're only cheating yourself."

You cheat investors.

replies(1): >>43468660 #
53. trott ◴[] No.43468176{3}[source]
They started with N >= 120x3 tasks, and gave each task to 4-9 humans. Then they kept only those 120x3 tasks that at least 2 humans had solved.
replies(1): >>43468272 #
54. Chathamization ◴[] No.43468268[source]
> Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI.

I don’t think that follows. Just because people fail to create ARC-AGI problems that are difficult for an AI to solve, doesn’t mean that said AI can just be plugged into a humanoid robot and it will now reliably cook dinner, order a pizza and drive to pick it up, take a bus to downtown to busk on the street and take the money back home, etc.

ARC-AGI is an interesting benchmark, but it’s extremely presumptive to think that these types of tests are going to demonstrate AGI.

replies(5): >>43468308 #>>43468359 #>>43468692 #>>43468875 #>>43471812 #
55. tananaev ◴[] No.43468272{4}[source]
That's a very small sample size by task. I wonder if they give the whole data set to an average human, what the result would be. I tried some simple tasks and they are doable, but I couldn't figure out the hard ones.
56. Palmik ◴[] No.43468308{3}[source]
In your example you already indicated two tasks that you think might be hard for AI but easy for humans.

Who said that cooking dinner couldn't be part of ARC-AGI-<N>?

replies(1): >>43468338 #
57. Centigonal ◴[] No.43468318[source]
Thank you for including cost (or really any proxy for efficiency) as a dimension to this prize!
58. Chathamization ◴[] No.43468338{4}[source]
That’s precisely what I meant in my comment by “these types of tests.” People are eventually going to have some sort of standard for what they consider AGI. But that doesn’t mean the current benchmarks are useful for this task at all, and saying that the benchmarks could be completely different in the future only underscores this.
replies(1): >>43468361 #
59. yorwba ◴[] No.43468359{3}[source]
The point isn't demonstrating AGI, but rather demonstrating that AGI definitely hasn't been reached yet.
60. pillefitz ◴[] No.43468361{5}[source]
They are useful to reach Arc-N+1
replies(1): >>43468411 #
61. nneonneo ◴[] No.43468395[source]
Nitpick: “Public” is misspelled as “pubic” in several of the captions on that page.
replies(2): >>43468675 #>>43468707 #
62. Chathamization ◴[] No.43468411{6}[source]
How are any of these a useful path to asking an AI to cook dinner?

We already know many tasks that most humans can do relatively easily, yet most people don’t expect AI to be able to do them for years to come (for instance, L5 self-driving). ARC-AGI appears to be going in the opposite direction - can these models pass tests that are difficult for the average person to pass.

These benchmarks are interesting in that they show increasing capabilities of the models. But they seem to be far less useful at determining AGI than the simple benchmarks we’ve had all along (can these models do everyday tasks that a human can do?).

replies(2): >>43469091 #>>43476270 #
63. az226 ◴[] No.43468455[source]
Which puzzles had the lowest solve rate? I did the first 10 and felt all easy (mentally solve it in 10-20 seconds for easier ones and 30-60 seconds for harder ones), I’d like to try the most difficult ones.
64. colordrops ◴[] No.43468658[source]
Yes, I looked that these and thought about what percentage of humans could even solve these. It seems that, unless average humans are not considered generally intelligence, the test for general intelligence should be passable by most humans.
replies(1): >>43486200 #
65. anshumankmr ◴[] No.43468660{11}[source]
And end users and developers and the general public too...

But here is the thing, I feel that even if its rote memorizing why GPT4o couldn't perform just as well on ArcAGI 1 on it or did the "reasoning" help in any way?

66. carra ◴[] No.43468675[source]
Maybe realizing those things is the actual test?
67. ◴[] No.43468692{3}[source]
68. Nuzzerino ◴[] No.43468706[source]
Why wasn’t the ICOM framework (D. Kelley) allowed to make a scoring submission after they claimed to have beaten the scores? Are you concerned that may appear to contradict your mission statement and alienate the AGI community?
69. anshumankmr ◴[] No.43468707[source]
Oof its still there... but yeah typos happen lol
70. jononor ◴[] No.43468875{3}[source]
The task you mention require intelligence but also a robot body with a lot of physical dexterity suited to a designed-for-humanoids world. That seems like an additional requirement on top of intelligence? Maybe we do not want an AGI definition to include that?

There are humans who cannot perform these tasks, at least without assistive/adapted systems such as a wheelchair and accessible bus.

replies(2): >>43469358 #>>43469425 #
71. littlestymaar ◴[] No.43468917{3}[source]
What makes you think so?

From ChatGPT 3.5 to o1, all LLMs progress came from investment in training: either by using much more data, or using higher quality data thanks to artificial data.

o1 (and then o3) broke this paradigm by applying a novel idea (RL+search on CoT) and that's because of it that it was able to make progress on ARC-AGI.

So IMO the success of o3 goes in favor of the argument of how we are in an idea-constrained environment.

replies(1): >>43469981 #
72. ustad ◴[] No.43468931[source]
Using AGI in the titles of your tests might not be accurate or appropriate. May I suggest NAI - Narrow AI?
replies(1): >>43469023 #
73. timonofathens ◴[] No.43468943[source]
ARC tasks are language-independent
74. NitpickLawyer ◴[] No.43468977{6}[source]
> $10000 in compute is a moving target

And it's also irrelevant in some fields. If you solve a "protein folding" problem that was a blocker for a pharma company, that 10k is peanuts now.

Same for coding. If you can spend 100$ / hr on a "mid-level" SWE agent but you can literally spawn 100 today and 0 tomorrow and reach your clients faster, again the cost is irrelevant.

75. NitpickLawyer ◴[] No.43468990{5}[source]
I don't think that's correct. They had 400 people receive some questions, and only kept the questions that were solved by at least 2 people. The 400 people didn't all receive 120 questions (they'd have probably got bored).

If you go through the example problems you'll notice that most are testing the "aha" moment. Once you do a couple, you know what to expect, but with larger grids you have to stay focused and keep track of a few things to get it right.

76. JFingleton ◴[] No.43469023{3}[source]
My prediction: we'll be arguing about what AGI actually is... Forever.
replies(1): >>43469094 #
77. fastball ◴[] No.43469091{7}[source]
The "everyday tasks" you specifically mention involve motor skills that are not useful for measuring intelligence.
78. throwuxiytayq ◴[] No.43469094{4}[source]
Or depending on your outlook, for a couple of years, and then we will no longer be participating in these or any other cognitive exercises.
79. jononor ◴[] No.43469112{3}[source]
Not Greg/team, so unrelated opinion. o3 solution for ARC v1 was incredibly expensive. Some good ideas are at least needed to take that cost down by a factor 100-10000x.
replies(1): >>43469964 #
80. JFingleton ◴[] No.43469115{4}[source]
I'm not sure that's quite correct about the Turing test. From Wikipedia:

"Turing did not explicitly state that the Turing test could be used as a measure of "intelligence", or any other human quality. He wanted to provide a clear and understandable alternative to the word "think", which he could then use to reply to criticisms of the possibility of "thinking machines" and to suggest ways that research might move forward."

81. fastball ◴[] No.43469139[source]
I did the first 10 from ARC-AGI-2 (hard) set. 9 were in one try, 1 was in two.

To be fair I've spent a lot of time thinking about cellular automata and Conway's game of life, which definitely seems to be influencing the design of these puzzles.

82. fastball ◴[] No.43469183[source]
I don't know if this was a design goal, but I just did the first 10 Arc-AGI-2 public eval (hard) puzzles, and found them much more enjoyable (as a human) than any of the Arc-AGI-1 puzzles. That said the grid/puzzle editor is still a little clunky – would be nice to be able to drag-to-paint and have an adjustable brush size.
83. jononor ◴[] No.43469318[source]
Not the team, just follow ARC on-and-off as a ML engineer. I think it will take a few years (at least) to see the impact of ARC, especially the more conceptual works. Those are more close to basic research than applied - It will take time before the lessons are transferred to applications (that also requires considerable R&D). But more importantly, current LLM-based systems and the in-the-spirit-of-ARC-systems have quite different goals. The ARC challenge is intended to measure and build system which can learn efficiently - that is, be able to solve a novel task with very little new data. Ref F. Chollet paper "On the Measure of Intelligence". Current LLMs do not care for learning efficiency at all - actually the strategy is completely opposite - they aim to utilize ss much data and compute as possible to make the most capable system (at least on task that are somehow spanned by the training data). Which works well, but is for sure quite costly and it might also limit applications to those that not require a lot of learning at runtime (we still do not know how far we can take in-context learning). ARC brings in a fresh perspective, but I expect it to take several years for the approaches to really start cross-pollinating.
84. ZeroTalent ◴[] No.43469330{3}[source]
Also we barely understand how cognition works, AFAIK.
85. aziaziazi ◴[] No.43469358{4}[source]
I read that has “humans can perform these task, at least with…”

Put the computer in a wheelchair of his choice and let him try to catch the bus. How would you compare program and human reasoning abilities, but disregarding human ability to interact with the outside world?

Edit: Arc-AGI itself is only approachable by visually and manually valid humans, others needs assistive devices.

86. Chathamization ◴[] No.43469425{4}[source]
> at least without assistive/adapted systems such as a wheelchair and accessible bus.

Which is precisely what the robotic body I mentioned would be.

You're talking about humans who have the mental capacity to do these things, but who don't control a body capable of doing them. That's the exact opposite of an AI that controls a body capable of doing these things, but lacks the mental capacity to do them.

87. Davidzheng ◴[] No.43469429{5}[source]
Also there's no clear way to verify the solution. There could be easily multiple rules which works on the same examples
88. mapmeld ◴[] No.43469493{3}[source]
No, they're saying that the problems have been reviewed / play-tested by ≥2 humans, so they are not considered unfair or too ambiguous to solve in two attempts (a critique of some Arc-AGI-1 puzzles that o3 missed). They have a lot of puzzles so they were divided among some number of testers, but I don't think every tester had to try every problem.
89. torginus ◴[] No.43469964{4}[source]
Yeah my analogy for that solution is like claiming to have solved sorting arrays by using enormous compute to try all possible orderings of arrays of length 100.

It's not a real solution because:

- It's way too expensive

- It doesn't scale the way a real solution does

90. torginus ◴[] No.43469981{4}[source]
This isn't a novel idea - some people tried the exact same thing the day GPT4 came out.

And going back even further, there's Goal Oriented Action Planning - an old timey video game AI technique, that's basically searching through solution space to construct a plan:

https://medium.com/@vedantchaudhari/goal-oriented-action-pla...

(besides the fact that almost all old timey AI is state space solution search)

replies(1): >>43480778 #
91. CooCooCaCha ◴[] No.43471812{3}[source]
The statement you quoted is a general statement, not specific to ARC-AGI.

The scenarios you listed are examples of what they’re talking about. Those are tasks that humans can easily do but robots have a hard time with.

92. YeGoblynQueenne ◴[] No.43475882{5}[source]
>> Does this implicitly suggest that it is impossible to quantitatively assess a system’s ability to understand language? (Using the term “system” in the broadest possible sense)

I don't know and I don't have an opinion. I know that tests that claimed to measure language understanding, historically, haven't. There's some literature on the subject if you're curious (sounds like you are). I'd start here:

Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data

Emily M. Bender, Alexander Koller

https://aclanthology.org/2020.acl-main.463/

Quoting the passage that I tend to remember:

>> While large neural LMs may well end up being important components of an eventual full-scale solution to human-analogous NLU, they are not nearly-there solutions to this grand challenge. We argue in this paper that genuine progress in our field — climbing the right hill, not just the hill on whose slope we currently sit —depends on maintaining clarity around big picture notions such as meaning and understanding in task design and reporting of experimental results.

93. YeGoblynQueenne ◴[] No.43475924{5}[source]
>> The first time a top lab spent millions trying to beat ARC was actually in 2021, and the effort failed.

Which top lab was that? What did they try?

>> ARC was the only benchmark that highlighted o3 as having qualitatively different abilities compared to all models that came before.

Unfortunately observations support a simpler hypothesis: o3 was trained on sufficient data about ARC-1 that it could solve it well. There is currently insufficient data on ARC-II to solve it therefore o3 can't solve it. No super magickal and mysterious qualitatively different abilities to all models that came before required whatsoever.

Indeed, that is a common pattern in machine learning research: newer models perform better on benchmarks than earlier models not because their capabilities increase with respect to earlier models but because they're bigger models, trained on more data and more compute. They're just bigger, slower, more expensive- and just as dumb as their predecessors.

That's 90% of deep learning research in a nutshell.

replies(1): >>43479221 #
94. mchusma ◴[] No.43476270{7}[source]
Genuine question, do you feel Waymo is not L5 self-driving? I Waymo has L5 but its not truly economic yet.
95. bubblyworld ◴[] No.43479221{6}[source]
I'm sorry, but what observations support that hypothesis? There were scores of teams trying exactly that - training LLMs directly on Arc-AGI data - and by and large they achieved mediocre results. It just isn't an approach that works for this problem set.

To be honest your argument sounds like an attempt to motivate a predetermined conclusion.

replies(1): >>43498398 #
96. littlestymaar ◴[] No.43480778{5}[source]
What's new is to apply that to LLMs, that is.

> This isn't a novel idea - some people tried the exact same thing the day GPT4 came out.

What do you mean? Since GPT4's weights aren't available, you can't run RL on it by yourself. Only OpenAI can.

97. cubefox ◴[] No.43486200{3}[source]
I would argue that also small children and even most animals count as "general" intelligences. Animals are much less intelligent than grown humans, but that doesn't mean they are less general. Just like, say, AlphaGo 2 is more intelligent but not more general than AlphaGo 1. Or Qwen 32B vs Qwen 7B. Model or brain size alone doesn't determine generality. Generality is more a question of architecture.
replies(1): >>43489447 #
98. colordrops ◴[] No.43489447{4}[source]
Is there a formal or at least clear consensus definition of "general" intelligence? I assume it involves some level of autonomy and ability to manage novel situations.
replies(1): >>43490976 #
99. jwpapi ◴[] No.43489514{3}[source]
Can you explain to me? Would the token strings be as easy to solve for humans as well?

Or let me ask differently. Can we still design text questions that are easy for humans and tough for AI?

100. cubefox ◴[] No.43490976{5}[source]
There is no consensus on this.

> I assume it involves some level of autonomy and ability to manage novel situations.

Yeah. Also operating in real-time (robotics) and being able to process sensory data only, instead of relying on preprocessed data like text tokens.

101. YeGoblynQueenne ◴[] No.43498398{7}[source]
In which case what is the point of your comment? I mean what do you expect me to do after reading it, reach a different predetermined conclusion?
replies(1): >>43501898 #
102. bubblyworld ◴[] No.43501898{8}[source]
Provide some evidence for your claims? This empty rhetoric stuff in every AI thread on HN wears me out a bit. I apologise for being a little aggressive in my previous comment.