Most active commenters
  • YeGoblynQueenne(9)
  • advael(9)
  • lelanthran(5)
  • Nimitz14(4)
  • rfoo(4)
  • (3)

←back to thread

Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)
394 points tomduncalf | 59 comments | | HN request time: 0.43s | source | bottom
1. mikeknoop ◴[] No.40712282[source]
(ARC Prize co-founder here).

Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:

> get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)

Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.

A couple important notes:

1. this result is on the public eval set vs private set (ARC Prize $).

2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.

All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard

EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this

replies(11): >>40712673 #>>40712907 #>>40713440 #>>40714116 #>>40714245 #>>40714428 #>>40715353 #>>40715468 #>>40715482 #>>40716604 #>>40718028 #
2. refibrillator ◴[] No.40712673[source]
Do you have any perspectives to share on Ryan's observation of a potential scaling law for these tasks and his comment that "ARC-AGI will be one benchmark among many that just gets solved by scale"?
replies(1): >>40714361 #
3. Nimitz14 ◴[] No.40712907[source]
Ah that's an important detail about public v private. Makes it a nice result but nearly as impressive as initially stated.
4. hackerlight ◴[] No.40713440[source]
Reminds me of the AlphaCode approach.

Why do you say it's sampling programs from "training data"? With that choice of words, you're rhetorically assuming the conclusion.

If he only sampled 20 programs, instead of 8000, will we still say the programs came from "training data", or will we say it's genuine OOD generalization? At what point do we attribute the intelligence to the LLM itself instead of the outer loop?

This isn't meant to be facetious. Because clearly, if the N programs sampled is very large, it's easy to get the right solution with little intelligence by relying on luck. But as N gets small the LLM has to be intelligent and capable of OOD generalization, assuming the benchmark is good.

5. refreshingdrink ◴[] No.40714116[source]
Also worth nothing that Ryan mentions

> In addition to iterating on the training set, I also did a small amount of iteration on a 100 problem subset of the public test set

and

> it's unfortunate that these sets aren’t IID: it makes iteration harder and more confusing

It’s not unfortunate: generalizing beyond the training distribution is a crucial part of intelligence that ARC is trying to measure! Among other reasons, developing with test-set data is a bad practice in ML because it hides the difficulty this challenge. Even worse, writing about a bunch of tricks that help results on this subset is extending the test-set leakage the blog post's readers. This is why I'm glad the ARC Prize has a truly hidden test set

replies(1): >>40715655 #
6. ec109685 ◴[] No.40714245[source]
There are similarities to the approach in this paper (though they trained a model from scratch): https://arxiv.org/pdf/2309.07062

How well would an LLM trained with a huge number of examples do on this test? Essentially with enough attention, Goodhart's law will take over.

7. mikeknoop ◴[] No.40714361[source]
ARC isn't perfect and I hope ARC is not the last AGI benchmark. I've spoken with a few other benchmark creators looking to emulate ARC's novelty in other domains, so I think we'll see more. The evolution of AGI benchmarks likely needs to evolve alongside the tech -- humans have to design these tasks today to ensure novelty but should expect that to shift.

One core idea we've been advocating with ARC is that pure LLM scaling (parameters...) is insufficient to achieve AGI. Something new is needed. And OPs approach using a novel outer loop is one cool demonstration of this.

8. sriku ◴[] No.40714428[source]
Part of the challenge I understood to be learning priors from the training set that can then be applied to an extended private test set. This approach doesn't seem to do any such "learning" on the go. So, supposing it accomplished 85% on the private test set, would it be construed to have won the prize with "we have AGI" being trumpeted out?
9. jd115 ◴[] No.40715353[source]
Reminds me a bit of Genetic Programming as proposed by John Holland, John Koza, etc. Ever since GPT came out, I've been thinking of ways to combine that original idea with LLMs in some way that would accelerate the process with a more "intelligent" selection.
replies(1): >>40718090 #
10. lelanthran ◴[] No.40715468[source]
Maybe I am missing something, but to me this looks like "Let's brute-force on the training data".

I mean, generating tens of thousands of possible solutions, to find one that works does not, to me, signify AGI.

After all, the human solving these problem doesn't make 10k attempts before getting a solution, do they?

The approach here, due to brute force, can't really scale: if a random solution to a very simple problem has a 1/10k chance of being right, you can't scale this up to non-trivial problems without exponentially increasing the computational power used. Hence, I feel this is brute-force.

replies(1): >>40716577 #
11. YeGoblynQueenne ◴[] No.40715482[source]
Ah, give it a rest. That's not "frontier AI research", neither is it any kind of reasoning. It's the dumbest of the dumb possible generate-and-test approach that spams a fire hose of Python programs until it hits one that works. And still it gets only 50% on the public eval.

How many thousands of Python programs does a human need to solve a single ARC task? That's what you get with reasoning: you don't need oodles of compute and boodles of sampling.

And I'm sorry to be so mean, but ARC is a farce. It's supposed to be a test for AGI but its only defense from a big data approach (what Francois calls "memorisation") is that there are few examples provided. That doesn't make the tasks hard to solve with memorisation it just makes it hard for a human researcher to find enough examples to solve with memorisation. Like almost every other AI-IQ test before it, ARC is testing for the wrong thing, with the wrong assumptions. See the Winograd Schema Challenge (but not yet the Bongard problems).

replies(3): >>40717360 #>>40719608 #>>40720800 #
12. rfoo ◴[] No.40715655[source]
... and we know that if we really want to nail it we'd better just pay someone else to create 1,000,000 more harder problems for training (without looking at any in test set, of course). i.e. make the training set distribution similar to test set again.

Because the thing we have now is data-hungry. Your brain is pre-trained on other similar challenges as well. What's the point of requiring it to "generalize beyond the training distribution" with so few samples?

Really, I thought LLMs ended this "can we pretrain on in-house prepared private data for ILSVRC" flame war already.

replies(2): >>40715788 #>>40715850 #
13. YeGoblynQueenne ◴[] No.40715788{3}[source]
The problem with that it is we know approaches that can generalise very well from very few examples, even one example, without any kind of pretraining, That requires a good background theory of the target domain (a "world model" in more modern parlance), and we don't know how to automatically generate that kind of theory; only human minds can do it, for now. But given such a theory the number of examples needed can be as few as 1. Clearly, if you can learn from one example, but find yourself using thousands, you've taken a wrong turn somewhere.

The concern with the data-hungry approach to machine learning, that at least some of us have, is that it has given up on the effort to figure out how to learn good background theories and turned instead to getting the best performance possible in the dumbest possible way, relying on the largest available amount of examples and compute. That's a trend against everything else in computer science (and even animal intelligence) where the effort is to make everything smaller, cheaper, faster, smarter: it's putting all the eggs in the basket of making it big, slow and dumb, and hoping that this will somehow solve... intelligence. A very obvious contradiction.

Suppose we lived in a world that didn't have a theory of computational complexity and didn't know that some programs are more expensive to run than others. Would it be the case in that world, that computer scientists competed in solving ever larger instances of the Traveling Salesperson Problem, using ever larger computers, without even trying to find good heuristics exploiting the structure of the problem and simply trying to out-brute-force each other? That world would look a lot like where we are now with statistical machine learning: a pell-mell approach to throwing all resources at a problem that we just don't know how to solve, and don't even know if we can solve.

replies(2): >>40715903 #>>40716388 #
14. advael ◴[] No.40715850{3}[source]
You seem to misunderstand why generalization is important for making claims about intelligent systems. To illustrate this, we could really easily design a system that encodes all the test set questions and their answers, puts them in an enormous hash table, and looks up the correct answer to each challenge when presented with it. This could probably score 100% on ARC if given the entire test set. Would you call this AGI? What if I put it through a transformer as a hashing function?

The mainstream attention LLMs have garnered has added a bunch of noise to the way we talk about machine learning systems, and unfortunately the companies releasing them are partially to blame for this. That doesn't mean we should change the definition of success for various benchmarks to better suit lay misunderstandings of how this all works

replies(1): >>40717004 #
15. advael ◴[] No.40715903{4}[source]
The formalism that data-driven machine learning leans on is empirical tuning of stochastic search to drive approximation of functions, and despite what Silicon Valley would have you believe, most of the significant advances have been in creating useful meta-structures for modeling certain kinds of problems (e.g. convolution for efficiently processing transformations that care about local structure across dimensions of data, or qkv attention for keeping throughlines of non-local correspondences intact through a long sequence). Neural networks as a flavor of empirical function approximation happened to scale well, and then a bunch of people who saw how much this scale improved the models' capabilities but couldn't be bothered to understand the structural component concluded that scale somehow magically gets you to every unsolved problem being solved. It's also convenient for business types that if you buy this premise, any unicorn they want to promise is just a matter of throwing obscene amounts of resources at the problem (through their company of course)

I think probably the general idea of dynamic structures that are versatile in their ability to approximate functional models is at least a solid hypothesis for how some biological intelligence works at some level (I think maybe the "fluid/crystallized" intelligence distinction some psychology uses is informative here - a strong world model probably informs a lot of quick acquisition of relationships, but most intelligent systems clearly posess strong feedback mechanisms for capturing new models), though I definitely agree that a focus on how best to throw a ton of scale at these models doesn't seem like a fruitful path for actionably learning how to build or analyze intelligent systems in the way we usually think about, nor is it, well, sustainable. Moore's law appeals to business people because buying more computronium feels more like a predictable input-output relationship to put capital into, but even if we're just talking about raw computation speed advances in algorithms tend to dwarf advances in computing power in the long run. I think the same will hold true in AGI

replies(1): >>40720145 #
16. yccs27 ◴[] No.40716388{4}[source]
Sadly, right now the "throw lots of compute at it in the dumbest possible way" models work, and the "learn good background theories" approaches have gone nowhere. It's Rich Sutton's Bitter Lesson and a lot of us aren't ready to accept it.

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

replies(1): >>40719660 #
17. killerstorm ◴[] No.40716577[source]
10000 samples are nothing compared to 2^100 possible outputs. It is absolutely, definitely not a "brute search". Testing a small fraction of possibilities (e.g. 0.000001%) is called heuristics, and that's what people use too.

Please learn a bit of combinatorics.

> After all, the human solving these problem doesn't make 10k attempts before getting a solution, do they?

No. People have much better "early rejection", also human brain has massive parallel compute capacity.

It's ridiculous to demand GPT-4 performs as good as a human. Obviously its vision is much worse and it doesn't have 'video' and physics priors people have, so it has to guess more times.

replies(1): >>40716765 #
18. machiaweliczny ◴[] No.40716604[source]
Do you accept such solutions as legit? It’s obviously is easier to generate program that to make prompt that will solve it
19. lelanthran ◴[] No.40716765{3}[source]
> 10000 samples are nothing compared to 2^100 possible outputs. It is absolutely, definitely not a "brute search". Testing a small fraction of possibilities (e.g. 0.000001%) is called heuristics, and that's what people use too.

Brute searching literally means generating solutions until one works. Which is exactly what is being done here.

> Please learn a bit of combinatorics.

Don't be condescending - I understand the problem space just fine. Fine enough to realise that the problem was constructed specifically to ensure that "solutions" such as this just won't work.

Which is why this "solution" is straight-up broken (doesn't meet the target, exceeds the computationally bounds, etc).

> It's ridiculous to demand GPT-4 performs as good as a human.

Wasn't the whole point of this prize to spur interest in a new approach to learning? What does GPT-[1234] have to do with the contest rules? Especially since this solution broke those rules anyway?

> Obviously its vision is much worse and it doesn't have 'video' and physics priors people have, so it has to guess more times.

That's precisely my point - it has to guess. Humans aren't guessing for those types of problems (not for the few that I saw anyway).

replies(2): >>40717295 #>>40718880 #
20. rfoo ◴[] No.40717004{4}[source]
First, LLMs are not AGI. Never will be. Can we talk now?

> if given the entire test set.

I don't want the entire test set. Or any single one in the test set.

The problem here is ARC challenge deliberately give a training set with different distribution than both the public and the private test set. It's like having only 1+1=2, 3+5=8, 9+9=18 in training set and then 1+9=10, 5*5=25, 16/2=8, (0!+0!+0!+0!)!=24 in test set.

I can see the argument of "giving the easy problems as demonstration of rules and then with 'intelligence' [1] you should be able to get harder ones (i.e. a different distribution)", but I don't believe it's a good way to benchmark current methods, mainly because there are shortcuts. Like I can teach my kids how factorial works and ! means factorial, instead of teaching them how addition works only and make them figure out how multiplication, division and factorial works and what's the notation.

[1] Whatever that means.

replies(3): >>40719825 #>>40720526 #>>40720676 #
21. ealexhudson ◴[] No.40717295{4}[source]
I think to be clear, brute force generally means an iterative search of a solution space. I don't think that's what this system is doing, and it's not like it's following some search path and returning as early as possible.

It's similar that a lot of wrong answers are being thrown up, but I think this is more like a probabilistic system which is being pruned than a walk of the solution space. It's much smarter, but not as smart as we would like.

replies(1): >>40717802 #
22. ◴[] No.40717360[source]
23. lelanthran ◴[] No.40717802{5}[source]
> I think to be clear, brute force generally means an iterative search of a solution space.

Sure, but not an exhaustive one - you stop when you get a solution[1]. Brute force does not require an exhaustive search in order to be called brute-force.

GP was using the argument that because it is not exhaustive, it cannot be brute-force. That's the wrong argument. Brute-force doesn't have to be exhaustive to be brute-force.

[1] Or a good enough solution.

replies(1): >>40717996 #
24. naasking ◴[] No.40717996{6}[source]
A brute force search can be expected to find a solution after a more thorough search of the space of possibilities. If it really is only searching 0.000001% of that space before finding solutions, then some structure of the problem is guiding the search and it's no longer brute force.
25. data_maan ◴[] No.40718028[source]
It's not that novel. Others have implemented this approach , in the context of mathematics.

Already the 2021 paper Drori (and many papers since) did similar things.

It's a common idea in this space...

26. lachlan_gray ◴[] No.40718090[source]
I’d love to hear more about this!
27. lelanthran ◴[] No.40719551{5}[source]
> I studied algorithms for years.

Who hasn't?

> You're 100% WRONG on everything you wrote.

Maybe you should update the wikipedia page, then all the other textbooks, that uses a definition of brute-force that matches my understanding of it.

From https://en.wikipedia.org/wiki/Brute-force_search

> Therefore, brute-force search is typically used when the problem size is limited, or when there are problem-specific heuristics that can be used to reduce the set of candidate solutions to a manageable size.

Further, in the same page https://en.wikipedia.org/wiki/Brute-force_search#Speeding_up...

> One way to speed up a brute-force algorithm is to reduce the search space, that is, the set of candidate solutions, by using heuristics specific to the problem class.

I mean, the approach under discussion is literally exactly this.

Now, Mr "ACM ICPC, studied algorithms for years", where's your reference that reducing the solution space using heuristics results in a non-brute-force algorithm?

replies(2): >>40720086 #>>40721550 #
28. jononor ◴[] No.40719608[source]
Do you have any suggestions for a better approach of testing artificial intelligence? I mean, in a way that allows comparing different approaches and being a reasonable metric of progress.
replies(1): >>40720015 #
29. lesuorac ◴[] No.40719660{5}[source]
> that ``brute force" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.

Mostly tangential to the article but I never really like this argument. Like you're playing a game a specific way and somebody else comes in with a new approach and mops the floor with you and you're going to tell me "they played wrong"? Like no, you were playing wrong the whole time.

replies(2): >>40722227 #>>40723528 #
30. advael ◴[] No.40719825{5}[source]
Okay I admit I'm confused and think I probably missed a crucial thing here. You're saying the publicly available problem set isn't indicative of the distribution of the test set? If so, I can see why you object to that. Still, it's potentially possible that the test's intention is to demonstrate something like progressive integration of compositionality given an abstract model. A lot of machine learning systems can do well as long as they've seen an example similar to the problem they've been presented, but can't do things like respond to a situation that presents them with a novel composition of two abstractions they seem to have already learned in the way a human can trivially.

Like only having [1+1=2, 4+5=9, 2+10=12] in the training set and [2*5=10, 3/4=.75, 2^8=256] in the test set would be bad, but something like [1+1=2, 3+4*2=11, 5*3=15, 2*7=14, 1+3/5=1.8, 3^3=27] vs [2+4*3=14, 3+3^2+4=16, 2*3/4+2^3/2^4=2] might not be, depending on what they're trying to test

Compositionality of information, especially of abstractions (like rules or models of a phenomenon), is a key criterion in a lot of people's attempts to operationally define "intelligence" (which I agree is overall a nebulous and overloaded concept, but if we're going to make claims about it we need at least a working definition for any particular test we're doing) I could see that meaning that the test set problems need to be "harder" in the sense that presenting compositions of rules in training doesn't preclude memorizing the combinations. But this is just a guess, I'm not involved in ARC and don't know, obviously*

replies(1): >>40720838 #
31. YeGoblynQueenne ◴[] No.40720015{3}[source]
I don't. I'm guessing -and it's nothing but a guess- that for every problem that can be solved with intelligence there exists a solution that does not require intelligence. I'm guessing in other words that intelligence is the ability to come up with solutions to arbitrary problems. If that's true then there's no way to test for intelligence by looking at the performance of a system at any particular task, or any finite set of tasks, and so there's no way to create a "test for intelligence".

My guess is supported by the experience that, in AI research, every time someone came up with a plausible test for intelligence, an AI system eventually passed the test only to make it clear that the test was not really testing intelligence after all (edit: I don't just mean formal tests; e.g. see how chess used to "require intelligence" right up until Deep Blue vs Kasparov).

Some people see that as "moving the goalposts" and it's certainly frustrating but the point is that we don't know what intelligence is, exactly, so it's very hard to test for its existence or not, or to measure it.

My preference would be for everyone in AI research to either stop what they're doing and try to understand what the hell intelligence is in the first place, to create a theory of intelligence so that AI can be a scientific subject again, or to at least admit they're not interested in creating artificial intelligence. I, for example, am not, but all my background is in subjects that are traditionally labelled "AI" so I have to suck it up, I guess.

replies(1): >>40720977 #
32. baobabKoodaa ◴[] No.40720086{6}[source]
This was extremely cringe worthy to read. You are confidently wrong about this. You are trying to redefine well established terminology. I don't care about whatever random wiki page you might find to "support your claims". Anybody who has worked a lot on algorithms (including myself) knows what brute force means and this is not it.

Also: lol at your "who hasn't" comment. Because you clearly haven't.

replies(1): >>40720217 #
33. YeGoblynQueenne ◴[] No.40720145{5}[source]
Yeah, very good points. To be fair there are people who have argued the big data side who have clearly solid knowledge of AI and are not just SV suits, for example I remember Yann LeCun in a debate with Christopher Manning, where Manning was arguing for the importance of "structure" and LeCun was arguing against it. Or see the "Bitter Lesson", mentioned in a parent comment. That may have become a total shibboleth of the Silicon bros but Rich Sutton, who wrote the eponymous article, is the guy who wrote the book on Reinforcement Learning (literally). And then Rodney Brooks' replied with his "Better Lesson" (https://rodneybrooks.com/a-better-lesson/). So there's a lot of debate in this and I don't reckon we'll have a consensus soon. It should be clear which side I'm on- I work with firmly model-based AI ("planning is the model-based approach to autonomous behaviour" has become my shibboleth - see Bonnet and Geffner's book on planning: https://link.springer.com/book/10.1007/978-3-031-01564-9) so maybe it's a deformation professionelle. And even LCun's recent plans for JEPA are very consciously model-based, except he wants to learn his models from data; which is not a bad idea I suppose.
replies(2): >>40720643 #>>40721179 #
34. lelanthran ◴[] No.40720217{7}[source]
> You are trying to redefine well established terminology.

Reference? Link, even?

> don't care about whatever random wiki page you might find to "support your claims".

That isn't some "random wiki" page; that's the wikipedia page for this specific term.

I'm not claiming to have defined this term, I'm literally saying I only agree with the sources for this term.

> Also: lol at your "who hasn't" comment. Because you clearly haven't.

Talk about cringe-worthy.

replies(1): >>40720602 #
35. blobbers ◴[] No.40720526{5}[source]
The problem is there is no way to infer the right answer to 0! given the training. You need more context to learn it. Humans need more context. If you put that at the end of every grade 1 math test no student would get it right unless they had some context.

Do grade 1 kids have AGI? (Haha)

But seriously, all professions need to train in context to solve complex problems. You can train in adjacent realms and reason about problems but to truly perform, you need more training.

A general surgeon might be better than an electrician as a vet, but that I’d rather have a veterinary surgeon operate on my dog.

So some things are “AGI” able and other things need specific training.

replies(1): >>40720905 #
36. baobabKoodaa ◴[] No.40720602{8}[source]
> Reference? Link, even?

Sure, here's definition for "brute force" from university textbook material written by pllk, who has taught algorithms for 20 years and holds a 2400 rating on Codeforces:

https://tira.mooc.fi/kevat-2024/osa9/

"Yleispätevä tapa ratkaista hakuongelmia on toteuttaa raakaan voimaan (brute force) perustuva haku, joka käy läpi kaikki ratkaisut yksi kerrallaan."

edit:

Here's an English language book written by the same author, though the English source does not precisely define the term:

https://cses.fi/book/book.pdf

In chapter 5:

"Complete search is a general method that can be used to solve almost any algorithm problem. The idea is to generate all possible solutions to the problem using brute force ..."

And a bit further down chapter 5:

"We can often optimize backtracking by pruning the search tree. The idea is to add ”intelligence” to the algorithm so that it will notice as soon as possible if a partial solution cannot be extended to a complete solution. Such optimizations can have a tremendous effect on the efficiency of the search."

Your mistake is that you for some reason believe that any search over solution space is a brute force solution. But there are many ways to search over a solution space. A "dumb search" over solution space is generally considered to be brute force, whereas a "smart search" is generally not considered to be brute force.

Here's the Codeforces profile of the author: https://codeforces.com/profile/pllk

edit 2:

Ok now I think I understand what causes your confusion. When an author writes "One way to speed up a brute-force algorithm ..." you think that the algorithm can still be called "brute force" after whatever optimizations were applied. No. That's not what that text means. This is like saying "One way to make a gray car more colorful is by painting it red". Is it still a gray car after it has been painted red? No it is not.

37. advael ◴[] No.40720643{6}[source]
I've commented here before that I find myself really conflicted on LeCunn's public statements. I think it's really hard to reconcile the fact that he's undeniably a world-leading expert with the fact that he does work for and represent a tech company in a big way, which means that it's both hard to tell when what he says, especially publicly, is filtered through that lens, either explicitly or just via cultural osmosis. I know some people still in academia (e.g. "Bitter Lesson") are following suit but given how much of this field has been scooped up by large tech firms, this necessarily means that what we get out of research from those firms is partially filtered through them. Like it sounds like you're in CS/AI academia so I'm sure you're familiar with the distorting effect this brain drain has had on the field. Research out of places like FAIR or deepmind or OpenAI (arguably they were different until about 2019 or so? Hard to say how much of that was ever true unfortunately) are being done and published by world-leading experts hired by these companies and obviously this research has continued to be crucial to the field, but the fact that it's in industry means there's obviously controls on what they can publish, and the culture of an institution like Facebook is definitely going to have some different effects on priorities than that of most universities, and so while we can all collectively try to take it all with a grain of salt in some way, there is no way to be careful enough to avoid tribal knowledge in the field being heavily influenced by the cultures and priorities of these organizations.

But even if this kind of thinking is totally organic, I think it could arise from the delayed nature of the results of data-driven methods. Often a major structural breakthrough for a data-driven approach drastically predates the most obviously impactful results from that breakthrough, because the result impressive enough to draw people's attention comes from throwing lots of data and compute at the breakthrough. The people who got the impressive result might not even be the same team as the one that invented the structure they're relying on, and it's really easy to get the impression that what changed the game was the scale alone, I imagine even if you're on one of those research teams. I've been really impressed by some of the lines of research that show that you can often distill some of these results to not rely so heavily on massive datasets and enormous parallel training runs, and think we should properly view results that come from these to be demonstrations of the power of the underlying structural insights rather than new results. But I think this clashes with the organizational priorities of large tech firms, which often view scale as a moat, and thus are motivated to emphasize the need for it

replies(1): >>40723513 #
38. astromaniak ◴[] No.40720676{5}[source]
> First, LLMs are not AGI.

It's the most generic thing we have right now, right?

> Never will be.

If there is no other breakthrough anytime soon we can engineer AGI-like things around LLMs. I mean LLM trained to use different attachments. Which can be other models and algorithms. Examples will be image recognition models and databases for algorithms. Even now ChatGPT can use Bing search and Python interpreter. First steps done, others will follow. The result will be not a true AGI, but still a very capable system. And there is another factor. Next models can be trained on high quality data generated by current models. Instead of internet random garbage. This should improve their spacial and logical abilities.

replies(1): >>40725199 #
39. ◴[] No.40720800[source]
40. rfoo ◴[] No.40720838{6}[source]
> You're saying the publicly available problem set isn't indicative of the distribution of the test set?

Yes. From https://arcprize.org/guide:

    Please note that the public training set consists of simpler tasks whereas the public evaluation set is roughly the same level of difficulty as the private test set.
    The public training set is significantly easier than the others (public evaluation and private evaluation set) since it contains many "curriculum" type tasks intended to demonstrate Core Knowledge systems. It's like a tutorial level.
replies(1): >>40720966 #
41. advael ◴[] No.40720905{6}[source]
I think there's variance in people's degree of compositionality, as well as how quickly they can pick up on novel relationships. Testing "intelligence" in humans has always been kind of fraught in the first place, but any capability we may care to measure is going to permit degrees, and there will be some variance in humans on it. We should expect this. There's variance in goddam everything

We should also expect machine learning systems to have somewhat different properties from human minds. Like computers are more likely to accomplish perfect recall, and we can scale the size of their memory and their processing speed. All these confounding variables can make it hard to make binary tests of a capability, which is really what ARC seems like it's trying to do. One such capability that AI researchers will often talk about is conceptual compositionality. People care about compositionality because it's a good way to demonstrate that an abstract model is being used to reason about a situation, which can be used in unseen but perhaps conceptually similar situations. This "generalization" or "abstraction" capability is really the goal, but it's hard to reason about how to test it, and "composition" (That is, taking a situation that's novel, but a straightforward application of two or more different abstractions the agent should already "know") is one more testable way to try to tease it out.

As you point out, humans often fail this kind of test, and we can rightly claim that in those cases, they didn't correctly grasp the insight we were hoping they had. Testing distilled abstractions versus memorization or superficial pattern recognition isn't just important to AI research, it's also a key problem in lots of places in human education

42. advael ◴[] No.40720966{7}[source]
Well, in this paragraph they seem to explain that their public evaluation set is meant to be indicative of the kind of jump in difficulty you can expect from the private test set. This to me implies that my guess is close: They're looking for models that can learn simple concepts and apply them to complex problems. Keeping the test set private seems to be an attempt at making it difficult to "cheat" at this by simply memorizing superficial details of the more complex problem set, which makes sense given that the whole point of this seems to be testing for systems that can use learned abstractions to tackle novel, out-of-distribution problems

Like with our toy "algebra" examples, sure there's a lot of emphasis on repetition and rote in primary education on these subjects, and that's one way to get people more consistent at getting the calculations right, but to be frank I don't think it's the best way, or as crucial as it's made out to be. What someone really needs to understand about algebra is how the notation works and what the symbols mean. Like I can't unsee the concept of "+" as a function that takes two operands and starts counting for as many steps as one would in the right operand, starting at the value of the left operand. When looking at algebra, the process I go through relies on a bunch of conceptual frameworks, like "Anything in the set of all arabic numerals can be considered a literal value". "Anything in the roman alphabet is likely a variable". "Any symbol is likely an infix operator, that is, a function whose operands are on either side of it". Some of the concepts I'm using are just notational convention. At some point I memorized the set of arabic numerals, what they look like, what each of them means, how they're generally written in relation to each other to express quantities combinatorically. Some of the concepts are logical relations about quantities, or definitions of functions. But crucially, the form of these distillations makes them composable. If I didn't really understand what "+" does, then maybe someone could give me some really bad homework that goes

1 + 30 = 31

20 + 7 = 27

3 + 10 = 13

And then present me the problem

20 + 10 + 3 = ?

And I'd think the answer is

20 + 10 + 3 = 213

That demonstrates some model of how to do these calculations, but it doesn't really capture all the important relationships the symbols represent

We can have any number of objections to this training set. Like I wasn't presented with any examples of adding two-digit numbers together! OR even any examples where I needed to combine numbers in the same rank!

Definitely all true. Probably mistakes we could make in educating a kid on algebraic notation too. It's really hard to do these things in a way that's both accomplishing the goal and testable, quantifiable. But many humans demonstrate the ability to distill conceptual understanding of concepts without exhaustive examples of their properties, so that's one of the things ARC seems to want to test. It's hard to get this perfectly right, but it's a reasonable thing to want

replies(1): >>40725300 #
43. Nimitz14 ◴[] No.40720977{4}[source]
You're basically paraphrasing fchollet's paper on intelligence and what he talked about in his most recent podcast appearance with dwarkesh.
replies(1): >>40726505 #
44. barfbagginus ◴[] No.40721179{6}[source]
The recent result shows SOTA progress from something as goofy as generating 5000 python programs until 0.06% of them pass the unit tests. We can imagine our own brains having a thousand random subconscious pre thoughts before our consciously registered though is chosen and amplified out of the hallucinatory subconscious noise. We're still at a point where we're making surprising progress from simple feedback loops, external tools and checkers, retries, backtracking, and other bells and whistles to the LLM model. Some of these even look like world models.

So maybe we can cure LLMs of the hallucinatory leprosy just by bathing them about 333 times in the mundane Jordan river of incremental bolt ons and modifications to formulas.

You should be able to think of the LLM as a random hallucination generator then ask yourself "how do I wire ten thousand random hallucination generators together into a brain?" It's almost certain that there's an answer... And it's almost certain that the answer is even going to be very simple in hindsight. Why? Because llms are already more versatile than the most basic components of the brain and we have not yet integrated them in the scale that components are integrated in the brain.

It's very likely that this is what our brains do at the component level - we run a bunch of feedback coupled hallucination generators that, when we're healthy, generates a balanced and generalizing consciousness - a persistent, reality coupled hallucinatory experience that we sense and interpret and work within as the world model. That just emerges from a network of self correcting natural hallucinators. For evidence, consider work in Cortical Columns and the Thousand brains theory. This suggests our brains have about a million Cortical Columns. Each loads up random inaccurate models of the world... And when we do integration and error correction over that, we get a high level conscious overlay. Sounds like what the author of the currently discussed SOTA did, but with far more sophistication. If the simplest most obvious approach to jamming 5,000 llms together into a brain gives us some mileage, then it's likely that more reasoned and intelligent approach could get these things doing feats like the fundamentally error prone components of our own brains can do when working together.

So I see absolutely no reason we couldn't build an analogy of that with llms as the base hallucinator. They are versatile and accurate enough. We could also use online training llms and working memory buffers as the base components of a Jepa model.

It's pretty easy to imagine that a society of 5000 gpt4 hallucinators could, with the right self administered balances and utilities, find the right answers. That's what the author did to win the 50%.

Therefore I propose that for the current generation it's okay to just mash a bunch of hallucinators together and whip them into the truth. We should be able to do it because our brains have to be able to do it. And if you're really smart, you will find a very efficient mathematical decomposition... Or a totally new model. But for every current LLM inability, it's likely to turn out that sequence of simple modifications can solve it. Will probably accrue a large number of such modifications before someone comes along and thinks of an all-new model then does way better, perhaps taking inspirations from the proposed solutions, or perhaps exploring the negative space around those solutions.

replies(1): >>40723557 #
45. killerstorm ◴[] No.40721550{6}[source]
You're asking for a definition of exhaustive search. Exhaustive search, by definition, goes through the entire search space. That's what word exhaustive means.

For a reference, check Cormen's "Introduction to Algorithms". Every mention of brute-force search is specifically to exhaustive search is which not feasible for bigger spaces.

> I mean, the approach under discussion is literally exactly this.

It's literally not. It DOES NOT REDUCE the candidate set. It generates most likely candidates, but it doesn't reduce anything.

You lack basic understanding. Solutions are pixels grids, not Python programs. There's no search over pixel grids in the article. Not every search is exhaustive search.

This is like saying theoretical physicists are "brute-forcing" physics by generating candidate theories and testing them. Ridiculous.

46. entropicdrifter ◴[] No.40722227{6}[source]
Yeah, people get salty when their preconceptions are shattered, especially when they've invested a lot of time/energy in thinking based on the idea that they were sound.

It goes beyond simple sunk cost and into the realm of reality slapping them with a harsh "humans aren't special, grow up", which I think is especially bitter for people who aren't already absurdists or nihilists.

47. YeGoblynQueenne ◴[] No.40723513{7}[source]
Absolutely, industry and its neverending piggy bank have had a severe distorting effect on the direction of research. I'm a post-doc btw, right now working on robotic autonomy. I don't have direct experience of the brain drain- I'm in a UK university- but I can see the obvious results in the published research which has very suddenly lurched towards LLMs recently, as it did a very sudden lurch towards CNNs after 2012 etc.

Like you say, large tech corps clearly see big data approaches as a moat, as a game that they can play better than anyone else: they got the data, they got the compute, and they got the millions to hoover up all the "talent". Obviously, when it's corporations driving research they are not going to drive it towards a deepening of understanding and an enriching of knowledge, the only thing they care about is selling stuff to make money, and to hell with whether that stuff works or not and why. I'm worried even that this is going to have a degrading effect on the output of science and technology in general, not just AI and CS. It's like a substantial minority of many fields of science have given up on basic research and are instead feeding data to big neural nets and poking LLMs to see what will fall out. This is a very bad situation. Not a winter but an Eternal Summer.

Take it away, Tom.

https://youtu.be/0qjCrX9KS88?si=CEXr0cv-lUeG4UNO

replies(1): >>40729684 #
48. YeGoblynQueenne ◴[] No.40723528{6}[source]
No the reason for the disappointment was that early AI pioneers considered chess a model of human intelligence and they expected a chess-playing AI to help them understand how human intelligence works. To have computer chess devolve into a race to beat human champions using techniques that only computers can use clearly defeated this purpose.

Those "early pioneers" were people like Alan Turing, Claude Shannon, Marvin Minsky, Donald Michie and John McCarthy, all of whom were chess players themselves and were prone to thinking of computer chess as a window into the inner workings of the human mind. Here's what McCarthy had to say when Deep Blue beat Kasparov:

In 1965 the Russian mathematician Alexander Kronrod said, "Chess is the Drosophila of artificial intelligence." However, computer chess has developed much as genetics might have if the geneticists had concentrated their efforts starting in 1910 on breeding racing Drosophila. We would have some science, but mainly we would have very fast fruit flies.

Three features of human chess play are required by computer programs when they face harder problems than chess. Two of them were used by early chess programs but were abandoned in substituting computer power for thought.

http://www-formal.stanford.edu/jmc/newborn/newborn.html

Then he goes on to discuss those three features of human chess play. It doesn't really matter which they are but it's clear that he is not complaining about anyone "playing wrong", he's complaining about computer chess taking a direction that fails to contribute to a scientific understanding of human, and I would also say machine, intelligence.

49. YeGoblynQueenne ◴[] No.40723557{7}[source]
Thanks for the comment but I have to say: woa there, hold your horses. Hallucinations as the basis of intelligence? Why?

Think about it this way: ten years ago, would you think that hallucinations have anything to do with intelligence? If it were 2012, would you think that convolutions, or ReLus, are the basis of intelligence instead?

I'm saying there is a clear tendency within AI research, and without, to assume that whatever big new idea is currently trending is "it" and that's how we solve AI. Every generation of AI reseachers since the 1940's has fallen down that pit. In fact, no lesser men than Walter Pitts and Warren McCulloch, the inventors of the artificial neuron in 1943, firmly believed that the basis of intelligence is propositional logic. That's right. Propositional logic. That was the hot stuff at the time. Besides, the first artificial neuron was a propositional logic circuit that learned its own boolean function.

So keep an eye out for being carried away on the wings of the latest hype and thinking we got the solution to every problem just because we can do yet another thing, with computers, that we couldn't do before.

50. advael ◴[] No.40725199{6}[source]
The balance of evidence seems to suggest that training on model outputs leads to a significant degradation in model accuracy, some have even called this "model collapse." I'm not going to say it's impossible that this situation can improve, but your intuition that high quality generated outputs are an obvious means to bootstrap the next leap forward in model quality is definitely a contrarian position as I understand it
replies(1): >>40747829 #
51. rfoo ◴[] No.40725300{8}[source]
I agree. However, it is not a clear cut what's fair and what is "gaming the benchmark" in this setup, for example:

- can I train on my own private training set (which is harder)?

- can I pretrain on The Pile or something similar, a dataset full of texts crawled from web?

- can I pretrain on elementary school textbooks?

It seems like the latter two is acceptable given the use of GPT-4o here. But then, are the latter two that different to the first one? GPT-4o have the public test set in its training data (GPT-4o is definitely trained on public GitHub repos).

What's the point of having a training set with different distribution in this case, other than making participating harder? Maybe it's to discourage data-hungry approaches, but if there are legit shortcuts, anyone who seriously want to win would take it.

52. YeGoblynQueenne ◴[] No.40726505{5}[source]
I watched the podcast you're referencing, on youtube, but I don't remember Chollet saying anything like what I say above.

Quite the contrary, Chollet seems convinced that a test for artificial intelligence, like an IQ test for AI, can be created and he has not only created one but also organised a Kaggle competition on it, and now is offering a $1 million prize to solve it. So how is anything he says or does compatible with what I say above, that it's likely there can't be a test for artificial intelligence?

replies(1): >>40760861 #
53. advael ◴[] No.40729684{8}[source]
Hot girl summer is cancelled we got hot GPU trying to bear the weight of humanity's hopes and dreams as they collapse into a single point summer

Hot market forces treated as inevitable as the ever-rising tides summer

Hot war with nuclear powers looming as a possibility on the world stage even as one such power's favored information warfare strategy of flooding all communication channels with noise becomes ever more indistinguishable from those channels' normal state summer

In a mad world, heavy metal oscillates between states of catharsis and prophecy

Anyway I really appreciate your taking the time to respond thoughtfully and am trying to channel your patient approach in my endeavors today. Hope your summer's going well, despite the looming threat of its eternity

54. astromaniak ◴[] No.40747829{7}[source]
There was a report about training small model on eventset completely generated by GPT4. Small stories using kids vocabulary. Probably (half) a year back. So it's possible. My idea was to mix in significant portion of generated texts on logic and reasoning. Which would be expensive to create using humans, but much cheaper using GPT. But you are right, process is unstable and will (likely) collapse without some extra efforts. Mixing is one way. That will make it still respond correctly on original data.
replies(1): >>40757547 #
55. advael ◴[] No.40757547{8}[source]
There's a world of difference between machine teaching approaches that can create a less complex model from a more capable one and bootstrapping a more capable model from synthetic data. And don't get me wrong, it's still very useful to be able to distill models in this way! Like it's in many cases low hanging fruit for optimizing the parameter count or other resource bottlenecks of the models in question. Maybe the original learned representation wasn't the simplest neural network that could approximate the same function to the tolerance we care about. This streamlining can sometimes even induce distillation of certain abstractions, which I think has best been used in Motion Transfer results like Nvidia's Megaportraits or more recently Alibaba's EMO. However, if there's a scale-oriented path to language models that generalize better - or are more controllable, or just do better on established benchmarks - that is currently bottlenecked by available data, it seems unlikely that relying on synthetic data from extant models will get it over that hurdle, and this should kind of match your intuition if you're familiar with the information theory underlying statistical models, which neural networks of any kind are:

A model's predictions are necessarily going to be a compression of the data available to them, and so the hypothetical information-theoretic best case scenario is that a model trained on its own outputs, or even those of models trained in a similar way on similar volumes of data will generate diverse enough data to train a new model to replicate its own performance. In practice, this tends not to happen. Curation of available data can produce models with more focused distributions within the space of models we can feasibly train with the data and resources available, and you can use ensemble learning techniques or I guess stuff like RLHF (Which is kind of a silly framing of that concept as some RL people have pointed out, but it's the one people are familiar with now), but all of this is essentially just moving around in a pareto front which may not contain any "strictly better" model for whatever criteria we care about

I think the scaling laws of these things are running up against some fundamental limits in terms of useful diversity of available data and computational feasibility of meaningful improvements in scale. While hype likes to pretend that anything that happens fast for a while is "exponential", there are lots of other families of functions that appear to shoot upward before plateauing after hitting some fundamental limit, like a sigmoid! To me, it makes more intuitive sense that the capacity of a given model family will hit a plateau than continue scaling indefinitely, especially when we start to run up against dataset limits, and if there's enough more data than the current major tech companies can have already gotten their hands on to train with to a degree that makes a dent, I'd be shocked

That's not to say that impressive results aren't still happening, they're just mostly tackling different problems - various modality transfers, distillation-like improvements that make extant capability sets cheaper (in computational terms) to run, superficial capability shifts that better refine a language model to serve a particular use case, etc. LLMs in their current form are probably in need of another significant qualitative breakthrough to overcome their fundamental problems. They're clearly quite useful to a lot of people in their current form. They just don't live up to all this hype that's flying around.

56. Nimitz14 ◴[] No.40760861{6}[source]
You clearly didn't listen much since Chollet's point is exactly that there is no one task that can test for AI. The test he created is supposed to take that into account.
replies(2): >>40761407 #>>40761459 #
57. ◴[] No.40761407{7}[source]
58. YeGoblynQueenne ◴[] No.40761459{7}[source]
First, don't be a jerk. Second, what's your problem? That you think I said that "no single task can't be used to test for AI"? I initially said:

>> "If that's true then there's no way to test for intelligence by looking at the performance of a system at any particular task, or any finite set of tasks, and so there's no way to create a "test for intelligence"."

Stress on or any finite set of tasks.

So, no, I didn't refer to a single task, if that's what you mean. What the hell do you mean and what the hell is your problem? Why is everyone always such a dick in this kind of discussion?

replies(1): >>40792883 #
59. Nimitz14 ◴[] No.40792883{8}[source]
Sorry. You, to me, came across as more interested in sharing your opinion than understanding what other people are saying. That's annoying. Maybe that's on me though.

Ok, you think no finite set of tasks can be used. Chollet is trying anyways. Maybe he is actually dynamically creating new tasks in the private set every time someone evaluates.

My main point was that I still think you're saying very similar things, quoting from the paper I mentioned:

> If a human plays chess at a high level, we can safely assume that this person is intelligent, because we implicitly know that they had to use their general intelligence to acquire this specific skill over their lifetime, which reflects their general ability to acquire many other possible skills in the same way. But the same assumption does not apply to a non human system that does not arrive at competence the way humans do. If intelligence lies in the process of acquiring skills then there is no task X such that skill at X demonstrates intelligence, unless X is a meta task involving skill acquisition across a broad range of tasks.

This to me sounds very similar to what you said:

> I'm guessing in other words that intelligence is the ability to come up with solutions to arbitrary problems.

And is also what collet talked about on the pod.