Most active commenters

YeGoblynQueenne(17)
dahart(11)
advael(9)
(9)
FeepingCreature(8)
refulgentis(7)
killerstorm(7)
infgeoax(6)
elicksaur(5)
bongodongobob(5)

Popular/hot comments

>>40712282 #
>>40712174 #
>>40714152 #
>>40712326 #
>>40715123 #
>>40713172 #
>>40713521 #
>>40713006 #
>>40714565 #
>>40712304 #
>>40715482 #
>>40715887 #
>>40717004 #

Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)

1. ◴[17 Jun 24 22:40 UTC] No.40711986[source]▶

>>40711484 (OP) #

2. traject_ ◴[17 Jun 24 22:42 UTC] No.40712005[source]▶

>>40711484 (OP) #

We don't actually know if it is SOTA, the previous SOTA solution also got around the same on the evaluation set.

replies(1): >>40712275 #

3. extr ◴[17 Jun 24 22:42 UTC] No.40712008[source]▶

>>40711484 (OP) #

Very cool. When GPT-4 first came out I tried some very naive approaches using JSON representations on the puzzles [0], [1]. GPT-4 did "okay", but in some cases it felt like it was falling for the classic LLM issue of saying all the right things but then then failing to grasp some critical bit of logic and missing the solution entirely.

At the time I noticed that many of the ARC problems rely on visual-spatial priors that are "obvious" when viewing the grids, but become less so when transmuted to some other representation. Many of them rely on some kind of symmetry, counting, or the very human bias to assume a velocity or continued movement when seeing particular patterns.

I had always thought maybe multimodality was key: the model needs to have similar priors around grounded physical spaces and movement to be able to do well. I'm not sure the OP really fleshes this line of thinking out, brute forcing python solutions is a very "non human" approach.

[0] https://x.com/eatpraydiehard/status/1632671307254099968

[1] https://x.com/eatpraydiehard/status/1632683214329479169

replies(2): >>40712644 #>>40716335 #

4. greatpostman ◴[17 Jun 24 22:57 UTC] No.40712143[source]▶

>>40711484 (OP) #

You know you’re approaching AGI when creating benchmarks gets difficult. This is only just beginning

replies(2): >>40712761 #>>40713061 #

5. rgbrgb ◴[17 Jun 24 22:58 UTC] No.40712154[source]▶

>>40711484 (OP) #

> 50% accuracy on the public test set for ARC-AGI by having GPT-4o

Isn't the public test set public on github and therefore GPT-4o trained on it?

replies(2): >>40712401 #>>40712472 #

6. bashfulpup ◴[17 Jun 24 22:59 UTC] No.40712167[source]▶

>>40711484 (OP) #

I looked at the website and have no idea how Arc is supposed to be AGI.

Can someone explain?

replies(2): >>40712815 #>>40712975 #

7. eigenvalue ◴[17 Jun 24 23:00 UTC] No.40712174[source]▶

>>40711484 (OP) #

The Arc stuff just felt intuitively wrong as soon as I heard it. I don't find any of Chollet's critiques of LLMs to be convincing. It's almost as if he's being overly negative about them to make a point or something to push back against all the unbridled optimism. The problem is, the optimism really seems to be justified, and the rate of improvement of LLMs in the past 12 months has been nothing short of astonishing.

So it's not at all surprising to me to see Arc already being mostly solved using existing models, just with different prompting techniques and some tool usage. At some point, the naysayers about LLMs are going to have to confront the problem that, if they are right about LLMs not really thinking/understanding/being sentient, then a very large percentage of people living today are also not thinking/understanding/sentient!

replies(11): >>40712233 #>>40712290 #>>40712304 #>>40712352 #>>40712385 #>>40712431 #>>40712465 #>>40712713 #>>40713110 #>>40713491 #>>40714220 #

8. HarHarVeryFunny ◴[17 Jun 24 23:08 UTC] No.40712233[source]▶

>>40712174 #

Actually the solution being discussed here is the one that Chollet mentioned in his interview with Dwarkesh, and only bolsters his case.

The LLM isn't doing the reasoning here, it's just pattern matching the before/after diff and generating thousands of Python programs. The actual reasoning is done by an agentic like loop wrapped around the LLM, as described in the linked blog.

replies(1): >>40713162 #

9. cma ◴[17 Jun 24 23:13 UTC] No.40712275[source]▶

>>40712005 #

Yeah and GPT4o was potentially trained on this test set and if the tried to hold it out it was still likely trained on discussions of the problems.

10. mikeknoop ◴[17 Jun 24 23:14 UTC] No.40712282[source]▶

>>40711484 (OP) #

(ARC Prize co-founder here).

Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:

> get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)

Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.

A couple important notes:

1. this result is on the public eval set vs private set (ARC Prize $).

2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.

All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard

EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this

replies(11): >>40712673 #>>40712907 #>>40713440 #>>40714116 #>>40714245 #>>40714428 #>>40715353 #>>40715468 #>>40715482 #>>40716604 #>>40718028 #

11. Smaug123 ◴[17 Jun 24 23:14 UTC] No.40712290[source]▶

>>40712174 #

> a very large percentage of people living today are also not thinking/understanding/sentient

This isn't that big a bullet to bite (https://www.lesswrong.com/posts/4AHXDwcGab5PhKhHT/humans-who... comes from well before ChatGPT's launch), and I myself am inclined to bite it. System 1 alone does not a general intelligence make, although the article is extremely interesting in asking the question "is System 1 plus Python enough for a general intelligence?". But it's not a very relevant philosophical point, because Chollet's position is consistent with humans being obsoleted and/or driven extinct whether or not the LLMs are "general intelligences".

His position is that training LLMs results in an ever-larger number of learned algorithms and no ability to construct new algorithms. This is consistent with the possibility that, after some threshold of size and training, the LLM has learned every algorithm it needs to supplant humans in (say) 99.9% of cases. (It would definitely be going out with a whimper rather than a bang, on that hypothesis, to be out-competed by something that _really is_ just a gigantic lookup table!)

12. threeseed ◴[17 Jun 24 23:16 UTC] No.40712304[source]▶

>>40712174 #

a) 50% result is not solving the problem. Especially when the implementation is brute forcing the problem and is against the spirit of ARC.

b) He is not being overly negative of LLMs. In fact he believes they will play a role in any AGI system.

c) OpenAI CTO has publicly said that ChatGPT 5 will not be significantly better than existing models. So the rate of improvements you believe in simply doesn't match reality.

replies(3): >>40712459 #>>40714041 #>>40714049 #

13. asperous ◴[17 Jun 24 23:19 UTC] No.40712326[source]▶

>>40711484 (OP) #

Having tons of people employ human ingenuity to manipulate existing LLMs into passing this one benchmark kind of defeats the purpose of testing for "AGI". The author points this out as it's more of a pattern matching test.

Though on the other hand figuring out which manipulations are effective does teach us something. And I think most problems boil down to pattern matching, creating a true, easily testable AGI test may be tough.

replies(5): >>40712503 #>>40712555 #>>40712632 #>>40713120 #>>40713156 #

14. traject_ ◴[17 Jun 24 23:22 UTC] No.40712352[source]▶

>>40712174 #

> It's almost as if he's being overly negative about them to make a point or something to push back against all the unbridled optimism.

I don't think it is like that but rather Chollet wants to see stronger neuroplasticity in these models. I think there is a divide between the effectiveness of existing AI models versus their ability to be autonomous, robust and consistently learn from unanticipated problems.

My guess is Chollet wants to see something more similar to biological organisms especially mammals or birds in their level of autonomous nature. I think people underestimate the degree of novel problems birds and mammals alone face in just simply navigating their environment and it is the comparison here that LLMs, for now at least, seem lacking.

So when he says LLMs are not sentient, he's asking to consider the novel problems animals let alone humans have to face in navigating their environment. This is especially apparent in young children but declines as we age and gain experience/lose a sense of novelty.

replies(1): >>40713640 #

15. adroniser ◴[17 Jun 24 23:25 UTC] No.40712385[source]▶

>>40712174 #

I don't see how the point about the typical human is relevant. Either you can reason or you can't, the ARC test is supposed to be an objective way to measure this. Clearly a vanilla LLM currently cannot do this, and somehow an expert crafting a super-specific prompt is supposed to be impressive.

replies(1): >>40712781 #

16. bongodongobob ◴[17 Jun 24 23:27 UTC] No.40712401[source]▶

>>40712154 #

I keep seeing this comment all over the place. Just because something exists 1 time in the training data doesn't mean it can just regurgitate that. That's not how training works. An LLM is not a knowledge database.

replies(2): >>40712453 #>>40713177 #

17. TacticalCoder ◴[17 Jun 24 23:31 UTC] No.40712431[source]▶

>>40712174 #

> I don't find any of Chollet's critiques of LLMs to be convincing. It's almost as if he's being overly negative about them to make a point or something to push back against all the unbridled optimism.

Chollet published his paper On the measure of intelligence in 2019. In Internet time that is a lifetime before the LLM hype started.

replies(2): >>40712651 #>>40712888 #

18. adroniser ◴[17 Jun 24 23:34 UTC] No.40712453{3}[source]▶

>>40712401 #

And yet it doesn't rule out that it can't. See new york times lawsuit

replies(1): >>40712544 #

19. janalsncm ◴[17 Jun 24 23:35 UTC] No.40712459{3}[source]▶

>>40712304 #

For the record, a lot of problems might turn out to be like this, where we figure out a brute force approach that stands in for human creativity.

20. imtringued ◴[17 Jun 24 23:36 UTC] No.40712465[source]▶

>>40712174 #

Yeah I agree. We have reached the end of LLMs. LLMs are infallible and require no further improvement. Anyone who points out shortcomings of current architectures and training approaches should be ignored as a naysayer. Anyone who proposes a solution to perceived flaws is a crank trying to fix something that was never broken. Everyone knows humans are incapable of internal monologues or visualization and vocalisation. Humans don't actually move their lips to speak to produce a sound that can be interpreted by a speaker of the same language, they produce universally understood tokens encoding objective reality and the fact that they use the local language is merely a habit that is hard to break out of.

replies(1): >>40720903 #

21. daemonologist ◴[17 Jun 24 23:36 UTC] No.40712472[source]▶

>>40712154 #

In this case I don't think having seen the Arc set would help much in writing and selecting python scripts to solve the test cases. (Unless someone else has tried this approach before and _their_ results are in the training data.)

It will be good to see the private set results though.

replies(2): >>40712793 #>>40713721 #

22. janalsncm ◴[17 Jun 24 23:40 UTC] No.40712503[source]▶

>>40712326 #

Perhaps if we don’t know how to create an evaluation that can’t be “gamed” it tells us something about how special our intelligence really is?

replies(1): >>40715393 #

23. bongodongobob ◴[17 Jun 24 23:44 UTC] No.40712544{4}[source]▶

>>40712453 #

From old pieces of articles that are quoted all over the internet? That's not surprising.

replies(1): >>40714639 #

24. opdahl ◴[17 Jun 24 23:46 UTC] No.40712555[source]▶

>>40712326 #

Wouldn’t the real AGI test be that an AI would be able to do what the author did here and write this blog post?

replies(2): >>40712730 #>>40716667 #

25. sheeshkebab ◴[17 Jun 24 23:59 UTC] No.40712632[source]▶

>>40712326 #

Show me a test and I’ll show you a neural network that passes it… used to be an saying.

26. trott ◴[18 Jun 24 00:00 UTC] No.40712635[source]▶

>>40711484 (OP) #

François Chollet says LLMs do not learn in-context. But Geoff Hinton says LLMs' few-shot learning compares quite favorably with people!

https://www.youtube.com/watch?v=QWWgr2rN45o&t=46m20s

The truth is in the middle, I think. They learn in-context, but not as well as humans.

The approach in the article hides the unreliability of current LLMs by generating thousands of programs, and still the results aren't human-level. (This is impressive work though -- I'm not criticizing it.)

replies(1): >>40714144 #

27. refulgentis ◴[18 Jun 24 00:00 UTC] No.40712644[source]▶

>>40712008 #

> brute forcing python solutions is a very "non human" approach.

ARC-AGI has odd features that leave me flummoxed by the naming and the attendant prize money and hype.

It is one singular task and frankly I strongly suspect someone could beat it within 30 days[1], in an unsatisfying way, as you note.

There's so much alpha that can be pieced together from here, ex. the last couple Google papers use the 1M context to do *500-shot*, i.e. 500 question answer examples. IIRC most recent showed raising travelling-salesman problem solve rate from 3 to 35%.

[1] I pre-registered this via a Twitter post, about 48 hours ago, i.e. before this result was announced.

replies(2): >>40712736 #>>40714197 #

28. refulgentis ◴[18 Jun 24 00:02 UTC] No.40712651{3}[source]▶

>>40712431 #

Einstein, infamously, couldn't really make much progress with quantum physics, even though he invented the precursors (ex. Brownian motion). Your world model is hard to update.

replies(2): >>40713512 #>>40713624 #

29. refibrillator ◴[18 Jun 24 00:06 UTC] No.40712673[source]▶

>>40712282 #

Do you have any perspectives to share on Ryan's observation of a potential scaling law for these tasks and his comment that "ARC-AGI will be one benchmark among many that just gets solved by scale"?

replies(1): >>40714361 #

30. biophysboy ◴[18 Jun 24 00:13 UTC] No.40712713[source]▶

>>40712174 #

I don't think he's as critical as you say. He just views LLMs as the product of intelligence rather than intelligence itself. LLM fans will say this is a false distinction, I guess.

His definition of intelligence is interesting: something that can quickly achieve tasks with few priors or experience. I also think the idea of using human "Core Knowledge" priors is a clever way to make a test.

31. atroche ◴[18 Jun 24 00:17 UTC] No.40712730{3}[source]▶

>>40712555 #

Yep, but a float is more useful than a bool for tracking progress, especially if you want to answer questions like "how soon can we expect (drivers/customer support staff/programmers) to lose their jobs?"

Hard to find the right float but worth trying I think.

replies(1): >>40713241 #

32. elicksaur ◴[18 Jun 24 00:18 UTC] No.40712736{3}[source]▶

>>40712644 #

The private test set has been available to crack for almost four years now. There was also a monetary prize competition run last year.

In your opinion, what has changed that would accelerate a solution to the next 30 days?

replies(1): >>40713193 #

33. elicksaur ◴[18 Jun 24 00:23 UTC] No.40712761[source]▶

>>40712143 #

Alternatively, society has no common understanding of what AGI means.

replies(1): >>40712937 #

34. eigenvalue ◴[18 Jun 24 00:26 UTC] No.40712781{3}[source]▶

>>40712385 #

The point is that if you have some test of whether an AI is intelligent that the vast majority of living humans would fail or do worse on than gpt4-o (let alone future LLMs) then it’s not a very persuasive argument.

35. TheDudeMan ◴[18 Jun 24 00:28 UTC] No.40712792[source]▶

>>40711484 (OP) #

"Vision is an especially large weakness."

But you can have GPT write code to reliably convert the image grid into a textual representation, right? And code to convert back to image and auto-verify.

replies(2): >>40713429 #>>40727801 #

36. cma ◴[18 Jun 24 00:28 UTC] No.40712793{3}[source]▶

>>40712472 #

Public discussions of solutions to the public test set will presumably have somewhat similar analogies and/or embeddings to aspects of the python programs that solve them.

37. TheDudeMan ◴[18 Jun 24 00:30 UTC] No.40712815[source]▶

>>40712167 #

It is necessary but not sufficient.

If you can't do ARC, you aren't general enough. But even if you can do ARC, you still might not be general enough.

replies(1): >>40712927 #

38. gibsonf1 ◴[18 Jun 24 00:40 UTC] No.40712864[source]▶

>>40711484 (OP) #

Isn't 50% kind of a failing grade?

replies(1): >>40713447 #

39. gwern ◴[18 Jun 24 00:42 UTC] No.40712888{3}[source]▶

>>40712431 #

From Chollet's perspective, the LLM hype started well before, with at least GPT-2 half a year before his paper, and he spent plenty of time mocking GPT-2 on Twitter before he came up with ARC as a rebuttal.

replies(1): >>40713697 #

40. Nimitz14 ◴[18 Jun 24 00:44 UTC] No.40712907[source]▶

>>40712282 #

Ah that's an important detail about public v private. Makes it a nice result but nearly as impressive as initially stated.

41. Workaccount2 ◴[18 Jun 24 00:48 UTC] No.40712927{3}[source]▶

>>40712815 #

It's also possible that you are an AGI and simply cannot pass ARC.

replies(1): >>40713114 #

42. Workaccount2 ◴[18 Jun 24 00:50 UTC] No.40712937{3}[source]▶

>>40712761 #

Which is why the first "artificial" sentient beings will likely go through the wringer that humanity historically put other "sub-human" beings through.

43. biophysboy ◴[18 Jun 24 00:57 UTC] No.40712975[source]▶

>>40712167 #

Its not a test of AGI. It tests whether you possess innate human capacities: rudimentary arithmetic & geometry, etc. Most of the problems were created manually. The original paper states that they limited the test to innate human priors to make the scope well-defined.

44. badrunaway ◴[18 Jun 24 01:02 UTC] No.40713006[source]▶

>>40711484 (OP) #

When we talk about system 2; is it possible that [generating large number of programs; evaluating them of the task; choosing top K outcomes; feeding it back to Neural net] can act as system 2 for a AGI? Isn't that how we think intelligently as well- by making lot of hypothesis internally and evaluating them - and updating our model?

replies(3): >>40713140 #>>40713161 #>>40721168 #

45. Slyfox33 ◴[18 Jun 24 01:09 UTC] No.40713061[source]▶

>>40712143 #

Benchmarks being difficult to create has no connection to something being agi.

46. lassoiat ◴[18 Jun 24 01:17 UTC] No.40713110[source]▶

>>40712174 #

I am a chatGPT fan boy and have been quite impressed by 4o but I will really be impressed when it stops inventing aspects of python libraries that don't exists and instead just tells me it doesn't exist.

It literally just did this for me 15 minutes ago. You can't talk about AGI when it is this easy to push it over the edge into something it doesn't know.

Paper references have got better the last 12 months but just this week it made up both a book and paper for me that do not exist. The authors exist and they did not write what it said they did.

It is very interesting if you ask "do you understand your responses?" sometimes it will say yes and sometimes it will so no not like a human understands.

We should forget about AGI until it can at least say it doesn't know something. It is hardly a sign of intelligence in humans to make up answers to questions you don't know.

replies(1): >>40713818 #

47. TheDudeMan ◴[18 Jun 24 01:18 UTC] No.40713114{4}[source]▶

>>40712927 #

How so? If there is a task that humans can do but the AI cannot, I would not call it AGI. But that's just my definition.

replies(1): >>40713172 #

48. worstspotgain ◴[18 Jun 24 01:19 UTC] No.40713120[source]▶

>>40712326 #

Let me play devil's advocate for a second. Let's suppose that with LLMs, we've actually invented an AGI machine that also happens to produce useful textual responses to a prompt.

This would sound more far-fetched if we knew exactly how they work, bit-by-bit. We've been training them statistically, via the data-for-code tradeoff. The question is not yet satisfactorily answered.

In this hypothetical, for every accusation that an LLM passes a test because it's been coached to do so, there's a counter that it was designed for "excessively human" AGI to begin with, maybe even that it was designed for the unconscious purpose of having humans pass it preferentially. The attorney for the hypothetical AGI in the LLM would argue that there are tons of "LLM AGI" problems it can solve that a human would struggle with.

Fundamentally, the tests are only useful insofar as they let us improve AI. The evaluation of novel approaches to pass them like this one should err in the approaches' favor, IMO. A 'gotcha' test is the least-useful kind.

replies(1): >>40713521 #

49. awwaiid ◴[18 Jun 24 01:23 UTC] No.40713140[source]▶

>>40713006 #

I think it's more like humans are a chaotic choir of subsystems all doing their thing and tossing up their directives until some sort of "win" happens or the volume is loud enough in some direction that it then gets reverse engineered into a "thought". But yes.

replies(1): >>40716559 #

50. yieldcrv ◴[18 Jun 24 01:25 UTC] No.40713156[source]▶

>>40712326 #

its LLM grade school. let them cook, train these things to match utility in our world. I'm not married to the "AGI" goal if there is other utility along the way.

51. spencerchubb ◴[18 Jun 24 01:26 UTC] No.40713161[source]▶

>>40713006 #

Possibly

I think we need those pieces, and also a piece for determining hypotheses in an efficient manner. Monte Carlo Tree Search could be that piece. Probabilistically choose a node to search, and then backpropagate the probabilities back to the root node.

replies(1): >>40716612 #

52. awwaiid ◴[18 Jun 24 01:26 UTC] No.40713162{3}[source]▶

>>40712233 #

When you peer into the soul of the machine it delicately resolves to `while(1){...}`. All Hail The REPL.

53. awwaiid ◴[18 Jun 24 01:29 UTC] No.40713172{5}[source]▶

>>40713114 #

Yeah but if my brother can't pass it, that doesn't mean he is NOT human.

replies(3): >>40713646 #>>40713654 #>>40714594 #

54. spencerchubb ◴[18 Jun 24 01:30 UTC] No.40713177{3}[source]▶

>>40712401 #

It could exist many times. People can fork and clone the repo. People are likely to copy the examples and share them online.

55. refulgentis ◴[18 Jun 24 01:33 UTC] No.40713193{4}[source]▶

>>40712736 #

Prize money meant people would more cleverly strain the rule that "the private test set stays private, no GPT4o, Claude etc.", as shown by the TFA.

This sort of idea would then be shared openly on new sites, creating more attempts. Fallout I did not anticipate was getting widespread attentional on general tech news sites, and then getting public comment from a prize co-founder confirming it was acceptable.

replies(1): >>40713751 #

56. opdahl ◴[18 Jun 24 01:41 UTC] No.40713241{4}[source]▶

>>40712730 #

I agree, but it does seem a bit strange that you are allowed to "custom-fit" an AI program to solve a specific benchmark. Shouldn't there be some sort of rule that for something to be AGI it should work as "off-the-shelf" as possible?

replies(1): >>40713415 #

57. Imnimo ◴[18 Jun 24 01:50 UTC] No.40713305[source]▶

>>40711484 (OP) #

To me the big take-aways here are:

1) Most of the heavy lifting is being done by search. We're talking about having the LLM generate thousands of candidate solutions, and they're mostly bad enough that "just pick the ones that get kinda close on the examples" is a meaningful operation.

2) More samples improves performance despite the fact that GPT-4o's vision is not capable of parsing the inputs. I'm curious how much performance would degrade if you shuffled the images passed to the model (but used the correct images when evaluating which candidates to keep).

3) It's definitely true that the LLM has to be giving you something more than random programs. At the very least, the LLM knows how to craft parsimonious programs that are more likely to be the solution. It may be that it's providing more than that, but it's not clear to me exactly how much information on the correct search space is coming from the hand-crafted examples in the prompt.

Overall, the work to get this far is very impressive, but it doesn't really move the needle for me on whether GPT-4 can do ARC puzzles. It does, however, show me that search is surprisingly powerful on this task.

58. soist ◴[18 Jun 24 02:09 UTC] No.40713415{5}[source]▶

>>40713241 #

If OpenAI had an embedded python interpreter or for that matter an interpreter for lambda calculus or some other equally universal Turing machine then this approach would work but there are no LLMs with embedded symbolic interpreters. LLMs currently are essentially probability distributions based on a training corpus and do not have any symbolic reasoning capabilities. There is no backtracking, for example, like in Prolog.

59. p1esk ◴[18 Jun 24 02:11 UTC] No.40713429[source]▶

>>40712792 #

GPT-4o might not have been trained on a sufficiently large amount of visual data to develop advanced spatial intelligence. Perhaps it needs to see a lot more images, or perhaps it needs to be trained differently (e.g. predict the next frame in a video). I suspect SORA has more spatial intelligence internally than 4o.

replies(1): >>40714606 #

60. hackerlight ◴[18 Jun 24 02:12 UTC] No.40713440[source]▶

>>40712282 #

Reminds me of the AlphaCode approach.

Why do you say it's sampling programs from "training data"? With that choice of words, you're rhetorically assuming the conclusion.

If he only sampled 20 programs, instead of 8000, will we still say the programs came from "training data", or will we say it's genuine OOD generalization? At what point do we attribute the intelligence to the LLM itself instead of the outer loop?

This isn't meant to be facetious. Because clearly, if the N programs sampled is very large, it's easy to get the right solution with little intelligence by relying on luck. But as N gets small the LLM has to be intelligent and capable of OOD generalization, assuming the benchmark is good.

61. p1esk ◴[18 Jun 24 02:13 UTC] No.40713447[source]▶

>>40712864 #

It's only been a week since the million dollar prize was offered. Let's see what the SOTA is in a month.

62. bearjaws ◴[18 Jun 24 02:22 UTC] No.40713480[source]▶

>>40711484 (OP) #

Seems that Arc-AGI is more flawed rather than GPT-4o is more AGI.

Maybe a AI version of Hanlons Razor. Never attribute to AGI what could be easily explained by being in the training set.

63. imperfect_light ◴[18 Jun 24 02:24 UTC] No.40713491[source]▶

>>40712174 #

Did you listen to what Chollet said? How much of LLM improvements are due to enlarging the training sets to cover more problems and how much is due to any emergent properties?

64. imperfect_light ◴[18 Jun 24 02:28 UTC] No.40713512{4}[source]▶

>>40712651 #

A bit of a stretch given that Chollet is a researcher in deep learning and transformers and his criticism is that memorization (training LLMs on lots and lots of problems) doesn't equate to AGI.

replies(1): >>40713693 #

65. vlovich123 ◴[18 Jun 24 02:29 UTC] No.40713521{3}[source]▶

>>40713120 #

There’s every reason to believe that AGI is meaningfully different from LLMs because humans do not take anywhere near this amount of training data to create inferences (that and executive planning and creative problem solving are clear weak spots in LLMs)

replies(3): >>40713651 #>>40714011 #>>40718212 #

66. infgeoax ◴[18 Jun 24 02:52 UTC] No.40713624{4}[source]▶

>>40712651 #

But it's his EPR paper inspired the Bell's inequality and pushed the field further. Yes he was wrong about how reality works, but still he asked the right question.

67. infgeoax ◴[18 Jun 24 02:55 UTC] No.40713640{3}[source]▶

>>40712352 #

Agree. When I first saw ARC, my reaction was this could possibly be the kind of problem that gives us evolutionary pressure.

68. Jensson ◴[18 Jun 24 02:57 UTC] No.40713646{6}[source]▶

>>40713172 #

Could he pass it if he was educated to do the task from birth? Human level intelligence includes being able to be educated, the ML models we have done so far can't be educated so have to match the level of educated humans to compare.

General intelligence as we know it requires ability to receive education.

69. og_kalu ◴[18 Jun 24 02:58 UTC] No.40713651{4}[source]▶

>>40713521 #

>There’s every reason to believe that AGI is meaningfully different from LLMs because humans do not take anywhere near this amount of training data to create inferences

The human brain is millions of years of brute force evolution in the making. Comparing it to a transformer or any other ANN really which essentially start from scratch relatively speaking doesn't mean much.

replies(1): >>40713702 #

70. infgeoax ◴[18 Jun 24 02:58 UTC] No.40713654{6}[source]▶

>>40713172 #

Isn't chatGPT already proven to be smarter than many of us in many ways?

71. bjornsing ◴[18 Jun 24 02:58 UTC] No.40713656[source]▶

>>40711484 (OP) #

Can we be sure GPT-4o hasn’t been trained on the public test set?

replies(1): >>40713748 #

72. refulgentis ◴[18 Jun 24 03:04 UTC] No.40713693{5}[source]▶

>>40713512 #

> A bit of a stretch

Is that true?

C.f. what we're discussing

He's actively encouraging using LLMs to solve his benchmark, called ARC AGI.

8 hours ago, from Chollet, re: TFA

"The best solution to fight combinatorial explosion is to leverage intuition over the structure of program space, provided by a deep learning model. For instance, you can use a LLM to sample a program..."

Source: https://x.com/fchollet/status/1802801425514410275

replies(1): >>40713793 #

73. modeless ◴[18 Jun 24 03:04 UTC] No.40713697{4}[source]▶

>>40712888 #

It's a very convincing rebuttal considering that GPT-3 and GPT-4 came out after ARC but made no significant progress on it. He seemingly had the single most accurate and verifiable prediction of anyone in the world (in 2019) about exactly what type of tasks scaled LLMs would be bad at.

replies(1): >>40729440 #

74. infgeoax ◴[18 Jun 24 03:05 UTC] No.40713702{5}[source]▶

>>40713651 #

Plus it's unclear if the amount of data used to "train" a human brain is really less than what GPT4 used. Imagine all the inputs from all the senses of a human over a lifetime: the sound, light, touches, interactions with peers, etc.

replies(2): >>40714140 #>>40714993 #

75. Truth_In_Lies ◴[18 Jun 24 03:08 UTC] No.40713721{3}[source]▶

>>40712472 #

Yes, someone has tried https://iprc-dip.github.io/DARC/

76. ivalm ◴[18 Jun 24 03:13 UTC] No.40713748[source]▶

>>40713656 #

It has been trained on public test set as it’s on github

replies(1): >>40714301 #

77. elicksaur ◴[18 Jun 24 03:14 UTC] No.40713751{5}[source]▶

>>40713193 #

It seems like you don’t understand the rules of the competition. Entries don’t have access to the internet. The OP acknowledges in their post that this is not eligible for the prize. The HN comment from the prize co-founder specifically says the OP’s claims haven’t been scrutinized. (implicit: they won’t be for the prize set unless the OP submits with an open LLM implementation)

There is a plan for a “public” leaderboard, but it currently has no entries, so we don’t actually know what the SOTA for the unrestrained version is. [1]

The general idea - test time augmentation - is what the current private set SOTA uses. [2] Generating more examples via transforming the samples is not a new idea.

Really, it seems like all the publicity has just gotten a bunch of armchair software architects coming up with 1-4 year-old ideas thinking they are geniuses.

[1] https://arcprize.org/leaderboard

[2] https://lab42.global/community-interview-jack-cole/

replies(2): >>40714174 #>>40716472 #

78. imperfect_light ◴[18 Jun 24 03:25 UTC] No.40713793{6}[source]▶

>>40713693 #

The stretch was in reference to comparing Chollet to Einstein. Chollet clearly understands LLMs (and transformers and deep learning), he simply doesn't believe they are sufficient for AGI.

replies(1): >>40714132 #

79. motoxpro ◴[18 Jun 24 03:33 UTC] No.40713818{3}[source]▶

>>40713110 #

Every time you’re wrong and you disagree with someone who is right you are inventing things that don’t exist.

Unless you’re saying you have never held on to a wrong opinion that was at some point proven to be wrong?

80. visarga ◴[18 Jun 24 04:13 UTC] No.40714011{4}[source]▶

>>40713521 #

How many attempts have there been for humans to solve math or science outstanding problems? We're also kind of spamming with ideas until one works out

replies(1): >>40714044 #

81. ◴[18 Jun 24 04:20 UTC] No.40714041{3}[source]▶

>>40712304 #

82. vlovich123 ◴[18 Jun 24 04:21 UTC] No.40714044{5}[source]▶

>>40714011 #

I’ll give you as much time as you want with an LLM and am 100% sure that it won’t solve a single outstanding complex math problem.

replies(2): >>40714251 #>>40718919 #

83. hackerlight ◴[18 Jun 24 04:21 UTC] No.40714049{3}[source]▶

>>40712304 #

Skeptical about (c), source please. She did say they don't have anything much better than GPT-4o currently, but GPT-5 likely only started training recently.

84. refreshingdrink ◴[18 Jun 24 04:36 UTC] No.40714116[source]▶

>>40712282 #

Also worth nothing that Ryan mentions

> In addition to iterating on the training set, I also did a small amount of iteration on a 100 problem subset of the public test set

and

> it's unfortunate that these sets aren’t IID: it makes iteration harder and more confusing

It’s not unfortunate: generalizing beyond the training distribution is a crucial part of intelligence that ARC is trying to measure! Among other reasons, developing with test-set data is a bad practice in ML because it hides the difficulty this challenge. Even worse, writing about a bunch of tricks that help results on this subset is extending the test-set leakage the blog post's readers. This is why I'm glad the ARC Prize has a truly hidden test set

replies(1): >>40715655 #

85. refulgentis ◴[18 Jun 24 04:39 UTC] No.40714132{7}[source]▶

>>40713793 #

I don't know what you mean, it's a straightforward analogy, but yes, that's right, except for the part where he's heralding this news by telling people the LLM is an underexplored solution space for a possible solution to his AGI benchmark he made to disprove LLMs are AGI.

I don't mean to offend, but to be really straightforward: he's the one saying it's possible they might be AGI now. I'm as flummoxed as you, but I think its hiding the ball to file it under "he doesn't mean what he's saying, because he doesn't believe LLMs can ever be AGI." The only steelman for that is playing at: AGI-my-benchmark, which I say is for AGI, is not the AGI I mean

replies(1): >>40714424 #

86. Jensson ◴[18 Jun 24 04:40 UTC] No.40714140{6}[source]▶

>>40713702 #

But that is of little help when you want to train an LLM to do the job at your company. A human requires just a little bit of tutorials and help, an LLM still require an unknown amount of data to get up to speed since we haven't reached that level yet.

replies(1): >>40714356 #

87. hackpert ◴[18 Jun 24 04:42 UTC] No.40714144[source]▶

>>40712635 #

I'm not sure how to quantify how quickly or well humans learn in-context (if you know of any work on this I'd love to read it!)

In general, there is too much fluff and confusion floating around about what these models are and are not capable of (regardless of the training mechanism.) I think more people need to read Song Mei's lovely slides[1] and related work by others. These slides are the best exposition I've found of neat ideas around ICL that researchers have been aware of for a while.

[1] https://www.stat.berkeley.edu/~songmei/Presentation/Algorith...

88. atleastoptimal ◴[18 Jun 24 04:44 UTC] No.40714152[source]▶

>>40711484 (OP) #

I'll say what a lot of people seem to be denying. GPT-4 is an AGI, just a very bad one. Even GPT-1 was an AGI. There isn't a hard boundary between non AGI and AGI. A lot of people wish there was so they imagine absolutes regarding LLM's like "they cannot create anything new" or something like that. Just think: we consider humans a general intelligence, but obviously wouldn't consider an embryo or infant a general intelligence. So at what point does a human go from not generally intelligent to generally intelligent? And I don't mean an age or brain size, I mean suite of testable abilities.

Intelligence is an ability that is naturally gradual and emerges over many domains. It is a collection of tools via which general abstractive principles can be applied, not a singular universally applicable ability to think in abstractions. GPT-4, compared to a human, is a very very small brain trained for the single purpose of textual thinking with some image capabilities. Claiming that ARC is the absolute market of general intelligence fails to account for the big picture of what intelligence is.

replies(7): >>40714189 #>>40714191 #>>40714565 #>>40715248 #>>40715346 #>>40715384 #>>40716518 #

89. refulgentis ◴[18 Jun 24 04:52 UTC] No.40714174{6}[source]▶

>>40713751 #

> It seems like you don’t understand the rules of the competition.

I don't think you "don't understand" anything :) I'd ask you, politely, to consider that when you're replying to other people in the future.

Better to bring to interactions the prior that your interlocutor is a presumably intelligent individual who can have a different interpretation of the same facts, than decide they just don't get it. The second is a quite lonely path.

> Entries don’t have access to the internet.

Correct. Per TFA, cofounder, Chollet, then me: this is an offline solution: the solution is the Python program found by an LLM.

> The HN comment from the prize co-founder specifically says the OP’s claims haven’t been scrutinized.

Objection: relevancy? Is your claim here that it might be false so we shouldn't be discussing it at all?

> (implicit: they won’t be for the prize set unless the OP submits with an open LLM implementation)

I don't know what this means, "open LLM implementation" is either a term of art I don't recognize, or a misunderstanding of the situation.

I do assume you read the article, so I'm not trying to talk down to you, but to clarify:

The solution is the Python program, not the LLM prompts that iterated on a Python program. A common thread that would describe the confusing experience of reading your comment phrased aggressively and disputing everything up until you agree with me: your observations assume I assume the solution requires a cloud-based LLM to run. As noted above, it doesn't, which is also the thrust of my comment: they found a way to skirt what I thought the rules are, and the co-founder and Chollett have embraced it, publicly.

> There is a plan for a “public” leaderboard, but it currently has no entries, so we don’t actually know what the SOTA for the unrestrained version is. [1]

This was false before you posted, when I checked this morning, and it was false as early as 4 days ago, June 14th, we can confirm via archive.is. (prefix the URL you provided with archive.is/ to check for yourself)

> The general idea - test time augmentation - is what the current private set SOTA uses. [2] Generating more examples via transforming the samples is not a new idea.

Did anyone claim it was?

> Really, it seems like all the publicity has just gotten a bunch of armchair software architects coming up with 1-4 year-old ideas thinking they are geniuses.

I don't know what this means other than you're upset, but yes, sounds like both you and I agree that having an LLM generate Python programs isn't quite what we'd thought would be an AGI solution in the eyes of Chollet.

Alas, here we are.

replies(2): >>40714210 #>>40720941 #

90. blharr ◴[18 Jun 24 04:56 UTC] No.40714189[source]▶

>>40714152 #

The "general" part of AGI implies it should be capable across all types of different tasks. I would definitely call it real Artificial Intelligence, but it's not general by any means.

replies(1): >>40714598 #

91. theptip ◴[18 Jun 24 04:56 UTC] No.40714191[source]▶

>>40714152 #

This seems to be so broad a definition as to no longer mean anything useful.

People in general are interested in capabilities or economic impact, and GPT-2 cleared no notable thresholds in those regards.

I prefer the exact opposite approach: let’s use a strict definition, and have levels to make it really explicit what we are talking about.

Here is a good one:

“Levels of AGI for Operationalizing Progress on the Path to AGI”

https://arxiv.org/abs/2311.02462

replies(1): >>40714948 #

92. nl ◴[18 Jun 24 04:58 UTC] No.40714197{3}[source]▶

>>40712644 #

I don't think this is "non-satisfying" at all.

Program synthesis has been mentioned as a promising approach by François Chollet, and that's exactly what this is.

The place I find slightly unsatisfying is this:

> Sample vast, vast numbers of completions (~5,000 per problem) from GPT-4o.

> Take the most promising 12 completions for each problem, and then try to fix each by showing GPT-4o what this program actually outputs on the examples, and then asking GPT-4o to revise the code to make it correct. We sample ~3,000 completions that attempt to fix per problem in total across these 12 starting implementations.

I'd been tossing around a MCTS idea similar to AlphaGo, based on the idea that the end transformation is a series of sub-transformations. I feel like this could work well alongside the GPT-4o completion catalog. (This isn't an original observation or anything)

replies(2): >>40714631 #>>40716375 #

93. nl ◴[18 Jun 24 05:03 UTC] No.40714210{7}[source]▶

>>40714174 #

(Not the OP)

>> (implicit: they won’t be for the prize set unless the OP submits with an open LLM implementation)

> The solution is the Python program, not the LLM prompts that iterated on a Python program. A common thread that would describe the confusing experience of reading your comment phrased aggressively and disputing everything up until you agree with me: your observations assume I assume the solution requires a cloud-based LLM to run. As noted above, it doesn't, which is also the thrust of my comment: they found a way to skirt what I thought the rules are, and the co-founder and Chollett have embraced it, publicly.

I think the implication is that solutions that use an LLM via an API won't be eligible (the "no internet" rule).

This seems obvious to solve: can use GPT4 to generate catalogs in advance and a lesser, local LLM with good code abilities to select them.

I don't see why this skirts any rules you think were implied and I'm puzzled why you think it does.

> sounds like both you and I agree that having an LLM generate Python programs isn't quite what we'd thought would be an AGI solution in the eyes of Chollet.

> Alas, here we are.

Chollet noted that program synthesis was a promising approach, so it's not surprising to me that a program synthesis approach that also uses an LLM is effective.

94. Lockal ◴[18 Jun 24 05:04 UTC] No.40714220[source]▶

>>40712174 #

That's a big jump in generalization that bruteforcing 4 colors in 9x9 grids with 8000 programs has anything near to what sentient human can do.

Back in the days similar generalization was used for Deep Blue chess computer. Computer won in 1997, but the AGI abyss is still as big.

95. ec109685 ◴[18 Jun 24 05:09 UTC] No.40714245[source]▶

>>40712282 #

There are similarities to the approach in this paper (though they trained a model from scratch): https://arxiv.org/pdf/2309.07062

How well would an LLM trained with a huge number of examples do on this test? Essentially with enough attention, Goodhart's law will take over.

96. danielbln ◴[18 Jun 24 05:11 UTC] No.40714251{6}[source]▶

>>40714044 #

I can say the same about myself, and I would probably consider myself generally intelligent.

replies(1): >>40714383 #

97. free_bip ◴[18 Jun 24 05:22 UTC] No.40714301{3}[source]▶

>>40713748 #

This isn't necessarily true, OpenAI isn't open about how their data cleaning process works.

98. infgeoax ◴[18 Jun 24 05:34 UTC] No.40714356{7}[source]▶

>>40714140 #

Yeah humans can generalize much faster than LLM with far fewer "examples" running on sandwiches and coffee.

replies(1): >>40714614 #

99. mikeknoop ◴[18 Jun 24 05:36 UTC] No.40714361{3}[source]▶

>>40712673 #

ARC isn't perfect and I hope ARC is not the last AGI benchmark. I've spoken with a few other benchmark creators looking to emulate ARC's novelty in other domains, so I think we'll see more. The evolution of AGI benchmarks likely needs to evolve alongside the tech -- humans have to design these tasks today to ensure novelty but should expect that to shift.

One core idea we've been advocating with ARC is that pure LLM scaling (parameters...) is insufficient to achieve AGI. Something new is needed. And OPs approach using a novel outer loop is one cool demonstration of this.

100. vlovich123 ◴[18 Jun 24 05:41 UTC] No.40714383{7}[source]▶

>>40714251 #

There’s a meaningful difference between a silicon intelligence and an organic one. Every silicon intelligence is closer to an equally smart clone whereas organic ones have much more variance (not to mention different training).

Anyway, my point was that humans butter direct their energy than randomly spamming ideas, at least with the innovation of the scientific method. But an LLM struggles deeply to perform reasoning.

101. imperfect_light ◴[18 Jun 24 05:51 UTC] No.40714424{8}[source]▶

>>40714132 #

You're reading a whole lot into a tweet, in his interview with Dwarkesh Patel he says, about 20 different times, that scaling LLMs (as they are currently conceived) won't lead to AGI.

replies(1): >>40714964 #

102. sriku ◴[18 Jun 24 05:51 UTC] No.40714428[source]▶

>>40712282 #

Part of the challenge I understood to be learning priors from the training set that can then be applied to an extended private test set. This approach doesn't seem to do any such "learning" on the go. So, supposing it accomplished 85% on the private test set, would it be construed to have won the prize with "we have AGI" being trumpeted out?

103. surfingdino ◴[18 Jun 24 06:14 UTC] No.40714565[source]▶

>>40714152 #

> GPT-4 is an AGI, just a very bad one.

Then stop selling it as a tool to replace humans. A fast moving car breaking through a barrier and flying off the cliff could be called "an airborne means of transportation, just a very bad one" yet nobody is suggesting it should replace school busses if only we could add longer wings to it. What the LLM community refuses to see is that there is a limit to the patience and the financing the rest of the world will grant you before you're told, "it doesn't work mate."

> So at what point does a human go from not generally intelligent to generally intelligent?

Developmental psychology would be a good place to start looking for answers to this question. Also, forgetting scientific approach and going with common sense, we do not allow young humans to operate complex machinery, decide who is allowed to become a doctor, or go to jail. Intelligence is something that is not equally distributed across the human population and some of us never have much of it, yet we function and have a role in society. Our behaviour, choices, preferences, opinions are not just based on our intelligence, but often on our past experiences and circumstances. It is also not the sole quality we use to compare ourselves against each other. A not very intelligent person is capable of making the right choices (an otherwise obedient soldier refusing to press the button and blow up a building full of children); similarly, a highly intelligent person can become a hard-to-find serial criminal (a gynecologist impregnating his patients).

What intelligent and creative people hold against LLMs is not that they replace them, but that they replace them with a shit version of them relegating thousands of years of human progress and creativity to the dustbin of the models and layers of tweaks to the output that still produce unreliable crap. I think the person who wrote this sign summed it up best https://x.com/gvanrossum/status/1802378022361911711

replies(3): >>40714760 #>>40714937 #>>40718593 #

104. TheDudeMan ◴[18 Jun 24 06:20 UTC] No.40714594{6}[source]▶

>>40713172 #

I said AGI. I did not say human.

105. FeepingCreature ◴[18 Jun 24 06:20 UTC] No.40714598{3}[source]▶

>>40714189 #

It's capable of attempting all types of different tasks. That is a novel capability on its own. We're used to GPT's amusing failures at this point, so we forget that there is absolutely no input you could hand to a chess program that would get it to try and play checkers.

Not so with GPT. It will try, and fail, but that it tries at all was unimaginable five years ago.

replies(1): >>40715160 #

106. ◴[18 Jun 24 06:22 UTC] No.40714606{3}[source]▶

>>40713429 #

107. logicchains ◴[18 Jun 24 06:23 UTC] No.40714614{8}[source]▶

>>40714356 #

>Yeah humans can generalize much faster than LLM with far fewer "examples" running on sandwiches and coffee.

This isn't really true. If you give an LLM a large prompt detailing a new spoken language, programming language or logical framework with a couple examples, and ask it to do something with it, it'll probably do a lot better at it than if you just let an average human read the same prompt and do the same task.

replies(1): >>40723361 #

108. bubblyworld ◴[18 Jun 24 06:25 UTC] No.40714631{4}[source]▶

>>40714197 #

Classic, I've been doing the same, writing an alphazero for the transformation part. What seems _much_ harder is picking a decent set of transformations/concepts to work with, or more generally automating that process. Maybe you're right that LLMs could help there!

replies(1): >>40716357 #

109. ben_w ◴[18 Jun 24 06:27 UTC] No.40714639{5}[source]▶

>>40712544 #

That's still sufficient for both The Times and for it to be a potential problem in this case.

110. bongodongobob ◴[18 Jun 24 06:47 UTC] No.40714760{3}[source]▶

>>40714565 #

In response to the sign: then learn to code or make art that is better than AI art.

It's an existential complaint. "Why won't the nerds make something for meeeee." Do it yourself. Make that robot.

Sucks to think that you're not that special. Most art isn't. Most music isn't. Any honest artist will agree. Most professional artists are graphic designers, not brilliant once in a generation visionaries. It's the new excuse for starving artists. AI or no, they'd still be unsuccessful. That's the way it's always been.

replies(2): >>40714852 #>>40714951 #

111. earthnail ◴[18 Jun 24 07:00 UTC] No.40714852{4}[source]▶

>>40714760 #

While that is 100% true, the real problem is that the risk of finding out whether you can make special art has significantly increased. Previously if you didn’t make it as an artist, you could still earn money with other art related tasks - in graphics, many became illustrators. In music, people made music for ads.

That plan B is now going away, and a music career will be much more like a sports career: either you make it in football, or you need to find another career where your football skills won’t be very useful.

That is obviously scary for many.

replies(1): >>40714927 #

112. surfingdino ◴[18 Jun 24 07:12 UTC] No.40714927{5}[source]▶

>>40714852 #

Artists who make it usually have a legend, a story to tell or be told by their friends, associates, agents, publishers, gallerists, etc. That story has a human dimension that touches the rest of us and we somehow connect to it. Van Gogh cut of his ear, we still keep talking about it and wondering why? There is nothing AI can tell us about itself, its "art". The artistic struggle with AI is not about expressing your vision on a canvas in a way that makes others feel what you want them to feel but about forcing it to generate something it is incapable of generating or programmed not to generate. We got to the point where we are given crayons programmed to not draw the things others do not want them to draw or to draw HR-approved version of what the artist wants to draw. The future is now and it's shit.

replies(1): >>40715859 #

113. atleastoptimal ◴[18 Jun 24 07:13 UTC] No.40714937{3}[source]▶

>>40714565 #

> What the LLM community refuses to see is that there is a limit to the patience and the financing the rest of the world will grant you before you're told, "it doesn't work mate."

The point about LLM's is they may have a lot of drawbacks right now but they're improving at a rapid pace. They already are very useful. There are hundreds of stories coming out of companies effectively leveraging them to replace workers in many natural-language related tasks. They're far more useful than a car that goes off a cliff.

Nobody more useful than an LLM is being effectively replaced by an LLM. Those few companies that jump the gun too early are suffering for it.

>That sign

We already have dishwashers and washing machines. Companies are working on making humanoid robots that can do those things, it's just that it's harder to develop a fully-fledged embodied humanoid than it is to create the diffusion models and LLM's being used today. It's not some conspiracy to let AI do all the fun stuff first.

Nobody is preventing anyone from making art or writing poetry. If someone finds value in AI art or writing, either you have to accept that they weren't the audience member you wanted, or you have to accept that your ability to be creative is a learnable algorithm same as anything else.

114. sigmoid10 ◴[18 Jun 24 07:15 UTC] No.40714948{3}[source]▶

>>40714191 #

People will never agree on this. We've known about the concept of intelligence for much, much longer than computers have been around and we still don't have a common definition or set of rules to check. That also makes things like consciousness and death pretty hard to define in medicine, leading to inconsistent rules across jurisdictions. For AGI in particular, I guarantee you that no matter which test gets beaten, the majority of humanity will always just move the goalpost and claim it's not "real" AGI because "reasons." Because the opposite would mean they have to admit that they are now the lesser intelligence on the planet.

115. intended ◴[18 Jun 24 07:15 UTC] No.40714951{4}[source]▶

>>40714760 #

The sign reads (paraphrased): “I want AI to do my dishes so I can do art. Not do art so I can do my dishes”

Your response is “learn to art” “The nerds dont owe you anything.” “Most of You would be unsuccessful anyway”

You brought in absolutely unrelated items.

1) Learn art - that is baked into what the Sign is saying. There is no Terminal Point for being an artist.

2) Nerds dont… - Where nerds come in as a class for this conversation?

2.1) if you can speak for all nerds, please note that I sure as heck dont want Warhammer 40k, I want Star Trek.

3) Most would be unsuccessful - so what?

Are they happy practicing their craft? Do they have the choice to spend their time on those pursuits and enrich their lives, and share their joys with others around them?

116. anoncareer0212 ◴[18 Jun 24 07:17 UTC] No.40714964{9}[source]▶

>>40714424 #

You keep changing topics so I don't get it either, I can attest it's not a fringe view that the situation is interesting, seen it discussed several times today by unrelated people.

replies(1): >>40720617 #

117. alchemist1e9 ◴[18 Jun 24 07:23 UTC] No.40714993{6}[source]▶

>>40713702 #

Don’t forget all the lifetimes of all ancestors as well. A lot of our intelligence is something we are born with and a result of many millions of years of evolution.

118. whiplash451 ◴[18 Jun 24 07:45 UTC] No.40715123[source]▶

>>40711484 (OP) #

The article jumps to the conclusion that "Given that current LLMs can perform decently well on ARC-AGI" after having used multiple hand-crafted tricks to get to these results, including "I also did a small amount of iteration on a 100 problem subset of the public test set" which is hidden in the middle of the article and not mentioned in the bullet list at the top.

Adding the close-to ad-hominem attack on Francois Chollet with the comics at the beginning (Francois never claimed to be a neuro-symbolic believer), this work does a significant disservice to the community.

replies(4): >>40715887 #>>40716039 #>>40716432 #>>40718813 #

119. dahart ◴[18 Jun 24 07:50 UTC] No.40715160{4}[source]▶

>>40714598 #

Its amusing to me how the very language used to describe GPT anthropomorphizes it. GPT wont “attempt” or “try” anything on its own without a human telling it what to try, it has no agenda, no will, no agency, no self-reflection, no initiative, no fear, and no desire. It’s all A and no I.

replies(2): >>40715775 #>>40719521 #

120. Tepix ◴[18 Jun 24 08:07 UTC] No.40715248[source]▶

>>40714152 #

The definition of AGI that i am familiar with is that it can do all (digital) tasks a human can do at the level of an average human.

As long as this level hasn't been achieved in all domains, it isn't AGI.

replies(1): >>40715391 #

121. comfortabledoug ◴[18 Jun 24 08:26 UTC] No.40715339[source]▶

>>40711484 (OP) #

I'm glad someone else finally said it, those born blind cannot possibly have AGI!

/sarcasm :D

122. dr_dshiv ◴[18 Jun 24 08:27 UTC] No.40715346[source]▶

>>40714152 #

Totally agree. So does Peter Norvig: https://www.noemamag.com/artificial-general-intelligence-is-...

123. jd115 ◴[18 Jun 24 08:29 UTC] No.40715353[source]▶

>>40712282 #

Reminds me a bit of Genetic Programming as proposed by John Holland, John Koza, etc. Ever since GPT came out, I've been thinking of ways to combine that original idea with LLMs in some way that would accelerate the process with a more "intelligent" selection.

replies(1): >>40718090 #

124. lucianbr ◴[18 Jun 24 08:35 UTC] No.40715384[source]▶

>>40714152 #

"A car is a plane, just a very bad one. It can do the cruising down the runway part, only the flying part is missing".

replies(1): >>40717643 #

125. lupusreal ◴[18 Jun 24 08:36 UTC] No.40715391{3}[source]▶

>>40715248 #

This seems like a problematic standard. For one, it's very fuzzy. A human who is top 51% and one who's top 49% are very similar to each other and could probably swap places depending on how they're feeling that day; there's nothing fundamentally different going on in their heads. Even at the further ends of the scale, humans have essentially the same kind of brains and minds as each other, some more capable than others but still all belonging to the same category of thinking things. Your AGI definition bifurcates the human population into those that possess general intelligence and those who don't, but this seems hard to justify. At least, hard to justify when drawn there. If you put the line at profound mental retardation where a person can no longer function in society, that would make more sense. A slightly below average human may not be exceptional in any regard but I think they still possess what must be regarded as general intelligence.

Furthermore, you're counting cases where humans do things the computer cannot but ignoring cases where the computer does things humans cannot. For instance, I doubt any human alive, let alone average humans can give reasonable explanations for short snippets of computer code in as many languages as GPT-4o, or formulate poetry in as many styles on arbitrary topics, or rattle off esoteric trivia and opinions about obscure historic topics, .... I think you get the point. It has already surpassed average human abilities in many categories of intellectually challenging tasks, but with your definition if it fails at even one task an average human can do, then it lacks "general intelligence."

I suggest that your definition is one for "AHI" (Average Human Intelligence), not one for "AGI" (Artificial General Intelligence.)

126. lucianbr ◴[18 Jun 24 08:37 UTC] No.40715393{3}[source]▶

>>40712503 #

I don't know how to create a liver, or test one, so what does that say about my liver? Pretty much nothing.

127. lelanthran ◴[18 Jun 24 08:53 UTC] No.40715468[source]▶

>>40712282 #

Maybe I am missing something, but to me this looks like "Let's brute-force on the training data".

I mean, generating tens of thousands of possible solutions, to find one that works does not, to me, signify AGI.

After all, the human solving these problem doesn't make 10k attempts before getting a solution, do they?

The approach here, due to brute force, can't really scale: if a random solution to a very simple problem has a 1/10k chance of being right, you can't scale this up to non-trivial problems without exponentially increasing the computational power used. Hence, I feel this is brute-force.

replies(1): >>40716577 #

128. YeGoblynQueenne ◴[18 Jun 24 08:57 UTC] No.40715482[source]▶

>>40712282 #

Ah, give it a rest. That's not "frontier AI research", neither is it any kind of reasoning. It's the dumbest of the dumb possible generate-and-test approach that spams a fire hose of Python programs until it hits one that works. And still it gets only 50% on the public eval.

How many thousands of Python programs does a human need to solve a single ARC task? That's what you get with reasoning: you don't need oodles of compute and boodles of sampling.

And I'm sorry to be so mean, but ARC is a farce. It's supposed to be a test for AGI but its only defense from a big data approach (what Francois calls "memorisation") is that there are few examples provided. That doesn't make the tasks hard to solve with memorisation it just makes it hard for a human researcher to find enough examples to solve with memorisation. Like almost every other AI-IQ test before it, ARC is testing for the wrong thing, with the wrong assumptions. See the Winograd Schema Challenge (but not yet the Bongard problems).

replies(3): >>40717360 #>>40719608 #>>40720800 #

129. nadam ◴[18 Jun 24 09:18 UTC] No.40715611[source]▶

>>40711484 (OP) #

Amazing work, prompt engineering at its finest. One future direction for Arc AGI could be to use not Python, but a much more concise programming language that is more suited for brute-force methods like genetic mutations. The problem would be of course to train an LLM that is proficient enough in such a language. I am thinking about stack based languages. For this competition I would develop a careful bit-level encoding of a variant of the 'Joy' programming language. (https://en.wikipedia.org/wiki/Joy_(programming_language)) It would be a considerable effort though which I don't have time for, hence I post this idea publicly. A promising direction is a mix of things in my opinion: Special stack-based concise language, consulting LLMs like the OP did, and genetic algorithms combined.

replies(1): >>40716693 #

130. rfoo ◴[18 Jun 24 09:28 UTC] No.40715655{3}[source]▶

>>40714116 #

... and we know that if we really want to nail it we'd better just pay someone else to create 1,000,000 more harder problems for training (without looking at any in test set, of course). i.e. make the training set distribution similar to test set again.

Because the thing we have now is data-hungry. Your brain is pre-trained on other similar challenges as well. What's the point of requiring it to "generalize beyond the training distribution" with so few samples?

Really, I thought LLMs ended this "can we pretrain on in-house prepared private data for ILSVRC" flame war already.

replies(2): >>40715788 #>>40715850 #

131. YeGoblynQueenne ◴[18 Jun 24 09:39 UTC] No.40715723[source]▶

>>40711484 (OP) #

>> Claim 1 seems likely true to me for a reasonable notion of “learning”. I think François Chollet agrees here. Most of my doubts about this claim are concerns that you can basically brute force ARC-AGI without interestingly doing learning (e.g. brute-force search over some sort of DSL or training on a huge array of very similar problems). These concerns apply much less to the kind of approach I used

The approach described in the article is exactly "brute-force search over some sort of DSL". The "DSL" is a model of Python syntax that GPT-4o has learned after training on the entire internet. This "DSL" is locked up in the black box of GPT-4o's weights, but just because no-one can see it, it doesn't mean it's not there; and we can see GPT-4o generating Python programs, so we know it is there, even if we don't know what it looks like.

That DSL may not be "domain specific" in the sense of being specifically tailored to solve ARC-AGI tasks, or any other particular task, but it is "domain specific" in the sense of generating Python programs for some subset of all possible Python programs that includes programs that can solve some ARC-AGI tasks. That's a very broad category, but that's why it over-generates so much: it needs to draw 8k samples total until one works for just 50% of the public eval set.

132. FeepingCreature ◴[18 Jun 24 09:49 UTC] No.40715775{5}[source]▶

>>40715160 #

Do you agree that "there is absolutely no input you could hand to a chess program that would get it to try and play checkers", but there is an input you can hand to GPT-3+ that will get it to try and play pretty much any game imaginable, so long as you agree that its attempt will be very poor?

I don't want to get into the weeds on what intelligence is or what "attempt" means or "try" means (you can probably guess I disagree with your position), but do you have a disagreement on pure input/output behavior? Do you disagree that if I put adequate words in, words will come out that will resemble an attempt to do the task, for nearly any task that exists?

replies(1): >>40720764 #

133. YeGoblynQueenne ◴[18 Jun 24 09:51 UTC] No.40715788{4}[source]▶

>>40715655 #

The problem with that it is we know approaches that can generalise very well from very few examples, even one example, without any kind of pretraining, That requires a good background theory of the target domain (a "world model" in more modern parlance), and we don't know how to automatically generate that kind of theory; only human minds can do it, for now. But given such a theory the number of examples needed can be as few as 1. Clearly, if you can learn from one example, but find yourself using thousands, you've taken a wrong turn somewhere.

The concern with the data-hungry approach to machine learning, that at least some of us have, is that it has given up on the effort to figure out how to learn good background theories and turned instead to getting the best performance possible in the dumbest possible way, relying on the largest available amount of examples and compute. That's a trend against everything else in computer science (and even animal intelligence) where the effort is to make everything smaller, cheaper, faster, smarter: it's putting all the eggs in the basket of making it big, slow and dumb, and hoping that this will somehow solve... intelligence. A very obvious contradiction.

Suppose we lived in a world that didn't have a theory of computational complexity and didn't know that some programs are more expensive to run than others. Would it be the case in that world, that computer scientists competed in solving ever larger instances of the Traveling Salesperson Problem, using ever larger computers, without even trying to find good heuristics exploiting the structure of the problem and simply trying to out-brute-force each other? That world would look a lot like where we are now with statistical machine learning: a pell-mell approach to throwing all resources at a problem that we just don't know how to solve, and don't even know if we can solve.

replies(2): >>40715903 #>>40716388 #

134. advael ◴[18 Jun 24 10:02 UTC] No.40715850{4}[source]▶

>>40715655 #

You seem to misunderstand why generalization is important for making claims about intelligent systems. To illustrate this, we could really easily design a system that encodes all the test set questions and their answers, puts them in an enormous hash table, and looks up the correct answer to each challenge when presented with it. This could probably score 100% on ARC if given the entire test set. Would you call this AGI? What if I put it through a transformer as a hashing function?

The mainstream attention LLMs have garnered has added a bunch of noise to the way we talk about machine learning systems, and unfortunately the companies releasing them are partially to blame for this. That doesn't mean we should change the definition of success for various benchmarks to better suit lay misunderstandings of how this all works

replies(1): >>40717004 #

135. latexr ◴[18 Jun 24 10:03 UTC] No.40715859{6}[source]▶

>>40714927 #

Using van Gogh as an example of “artists who make it” is insane.

Which I guess is appropriate, because he was literally crazy. He suffered from psychotic episodes and delusions and died from suicide, depressed and in poverty.

That’s the opposite of “making it”. It’s zero consolation that people like his work now, he never even knew.

replies(1): >>40716017 #

136. z7 ◴[18 Jun 24 10:06 UTC] No.40715887[source]▶

>>40715123 #

>Francois never claimed to be a neuro-symbolic believer

His response:

"This has been the most promising branch of approaches so far -- leveraging a LLM to help with discrete program search, by using the LLM as a way to sample programs or branching decisions. This is exactly what neurosymbolic AI is, for the record..."

"Deep learning-guided discrete search over program space is the approach I've been advocating, yes... there are many different flavors it could take though. This is one of them (perhaps the simplest one)."

https://x.com/fchollet/status/1802773156341641480

replies(3): >>40715928 #>>40718175 #>>40718230 #

137. advael ◴[18 Jun 24 10:08 UTC] No.40715903{5}[source]▶

>>40715788 #

The formalism that data-driven machine learning leans on is empirical tuning of stochastic search to drive approximation of functions, and despite what Silicon Valley would have you believe, most of the significant advances have been in creating useful meta-structures for modeling certain kinds of problems (e.g. convolution for efficiently processing transformations that care about local structure across dimensions of data, or qkv attention for keeping throughlines of non-local correspondences intact through a long sequence). Neural networks as a flavor of empirical function approximation happened to scale well, and then a bunch of people who saw how much this scale improved the models' capabilities but couldn't be bothered to understand the structural component concluded that scale somehow magically gets you to every unsolved problem being solved. It's also convenient for business types that if you buy this premise, any unicorn they want to promise is just a matter of throwing obscene amounts of resources at the problem (through their company of course)

I think probably the general idea of dynamic structures that are versatile in their ability to approximate functional models is at least a solid hypothesis for how some biological intelligence works at some level (I think maybe the "fluid/crystallized" intelligence distinction some psychology uses is informative here - a strong world model probably informs a lot of quick acquisition of relationships, but most intelligent systems clearly posess strong feedback mechanisms for capturing new models), though I definitely agree that a focus on how best to throw a ton of scale at these models doesn't seem like a fruitful path for actionably learning how to build or analyze intelligent systems in the way we usually think about, nor is it, well, sustainable. Moore's law appeals to business people because buying more computronium feels more like a predictable input-output relationship to put capital into, but even if we're just talking about raw computation speed advances in algorithms tend to dwarf advances in computing power in the long run. I think the same will hold true in AGI

replies(1): >>40720145 #

138. YeGoblynQueenne ◴[18 Jun 24 10:13 UTC] No.40715928{3}[source]▶

>>40715887 #

That kind of neuro-symbolic AI is a bit like British cuisine: place two different things next to each other in the same plate, like bangers and mash, and call it "a dish".

Nope. This is neurosymbolic AI:

Abductive Knowledge Induction From Raw Data

https://www.doc.ic.ac.uk/~shm/Papers/abdmetarawIJCAI.pdf

That's a symbolic learning engine trained in tandem with a neural net. The symbolic engine is learning to label examples for the neural net that learns to label examples for the symbolic engine. I call that cooking!

(Full disclosure: the authors of the paper are my thesis advisor and a dear colleague).

139. surfingdino ◴[18 Jun 24 10:27 UTC] No.40716017{7}[source]▶

>>40715859 #

> Using van Gogh as an example of “artists who make it” is insane.

So is building a tool that will only generate "approved" art. We need to be able to express our idea, feelings, our perception of the world in ways that do not fit corporate standards of text, audio, or visual communication. It's part of being human.

replies(2): >>40716142 #>>40716922 #

140. bogtog ◴[18 Jun 24 10:30 UTC] No.40716039[source]▶

>>40715123 #

For what it's worth, the comic is based on a well-known meme, and the author must've wanted to stick to the format: https://media.licdn.com/dms/image/D4E10AQFryt0thryEeA/image-...

141. latexr ◴[18 Jun 24 10:44 UTC] No.40716142{8}[source]▶

>>40716017 #

> So is building a tool that will only generate "approved" art.

And so is eating ice cream with your forehead. Are we just doing non sequiturs now? I didn’t defend image generation tools in the slightest.

> We need to be able to express our idea, feelings, our perception of the world in ways that do not fit corporate standards of text, audio, or visual communication.

I agree. My point started and ended with “van Gogh in an awful example when talking about artist who ‘made it’”. That’s it. There is nothing in there to be extrapolated to AI or any other subject.

142. YeGoblynQueenne ◴[18 Jun 24 11:11 UTC] No.40716335[source]▶

>>40712008 #

>> GPT-4 did "okay", but in some cases it felt like it was falling for the classic LLM issue of saying all the right things but then then failing to grasp some critical bit of logic and missing the solution entirely.

It still is. It misses the solution so comprehensively that it needs an outer loop to figure out which one is the solution out of 8k programs GPT-4o generates.

replies(1): >>40717373 #

143. luke-stanley ◴[18 Jun 24 11:14 UTC] No.40716357{5}[source]▶

>>40714631 #

Reminds me of NVIDIA Eureka: https://github.com/eureka-research/Eureka

replies(2): >>40716671 #>>40717870 #

144. YeGoblynQueenne ◴[18 Jun 24 11:15 UTC] No.40716375{4}[source]▶

>>40714197 #

>> Program synthesis has been mentioned as a promising approach by François Chollet, and that's exactly what this is.

To be precise, "this" -a bog-standard generate-and-test approach- is the dumbest possible way to do program synthesis. It's like sorting lists with bogosort and a very big computer.

It's exactly like bogosort: generate permutations and test. Except of course the system that generates permutations costs a few millions(?).

replies(1): >>40716650 #

145. yccs27 ◴[18 Jun 24 11:18 UTC] No.40716388{5}[source]▶

>>40715788 #

Sadly, right now the "throw lots of compute at it in the dumbest possible way" models work, and the "learn good background theories" approaches have gone nowhere. It's Rich Sutton's Bitter Lesson and a lot of us aren't ready to accept it.

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

replies(1): >>40719660 #

146. htrp ◴[18 Jun 24 11:18 UTC] No.40716397[source]▶

>>40711484 (OP) #

The expectation is that you'll have to have dynamically generated benchmarks with better eval at some point given the potential for brute forcing the private validation set.

147. killerstorm ◴[18 Jun 24 11:22 UTC] No.40716432[source]▶

>>40715123 #

I think this work is great.

A lot of top researchers claim that obvious deficiencies in LLM training are fundamental flaws in transformer architecture, as they are interested in doing some new research.

This work show that temporary issues are temporary. E.g. LLM is not trained on grid inputs, but can figure things out after preprocessing.

replies(1): >>40718119 #

148. sparsely ◴[18 Jun 24 11:25 UTC] No.40716460[source]▶

>>40711484 (OP) #

You can have a go at the problems here: https://arcprize.org/play?task=00576224

None of them are terribly hard but some aren't trivial either, a couple took me a bit of thinking to work out. By far the most tedious part is inputting the result (I didn't bother after the first) which is definitely something AI is better at!

149. YeGoblynQueenne ◴[18 Jun 24 11:27 UTC] No.40716472{6}[source]▶

>>40713751 #

I wish this comment was less confrontational because there's useful information in it and several points I agree with.

replies(1): >>40720803 #

150. killerstorm ◴[18 Jun 24 11:34 UTC] No.40716518[source]▶

>>40714152 #

Yes. GPT-3 was a clear AGI signal: "language models are few-shot learners". I.e. they can figure a pattern from few examples to apply it to something useful. That's general intelligence.

But people choose to be in denial.

151. badrunaway ◴[18 Jun 24 11:39 UTC] No.40716559{3}[source]▶

>>40713140 #

like darwin selection between the subsystem approaches? Put in a lot of different kind of LLMs and let them play the same game inside with each other.. whosoever wins the simulation is allowed to externally present the approach... something like that?

152. killerstorm ◴[18 Jun 24 11:40 UTC] No.40716577{3}[source]▶

>>40715468 #

10000 samples are nothing compared to 2^100 possible outputs. It is absolutely, definitely not a "brute search". Testing a small fraction of possibilities (e.g. 0.000001%) is called heuristics, and that's what people use too.

Please learn a bit of combinatorics.

> After all, the human solving these problem doesn't make 10k attempts before getting a solution, do they?

No. People have much better "early rejection", also human brain has massive parallel compute capacity.

It's ridiculous to demand GPT-4 performs as good as a human. Obviously its vision is much worse and it doesn't have 'video' and physics priors people have, so it has to guess more times.

replies(1): >>40716765 #

153. machiaweliczny ◴[18 Jun 24 11:44 UTC] No.40716604[source]▶

>>40712282 #

Do you accept such solutions as legit? It’s obviously is easier to generate program that to make prompt that will solve it

154. badrunaway ◴[18 Jun 24 11:44 UTC] No.40716612{3}[source]▶

>>40713161 #

Intuitively I feel efficiency is the outcome of existing world model.. approach can look like yours - I don't see why there has not been efforts on scaling monte carlo tree search for extending the existing world model via tree search. My guess is that it would diverge to hallucinations too fast because it doesn't have a strong logical building block already

155. bubblyworld ◴[18 Jun 24 11:47 UTC] No.40716650{5}[source]▶

>>40716375 #

Bogosort is driven by 0 heuristics - just shuffle and play. Using an LLM as a high-level prior over your search is very different, and the author had to do a lot of problem-specific tuning to make it work well.

replies(1): >>40720214 #

156. killerstorm ◴[18 Jun 24 11:50 UTC] No.40716667{3}[source]▶

>>40712555 #

I won't be surprised if GPT-5 would be able to do it: it knows that it's LLM, so it knows its limitations. It can write code to pre-process input in a format which is better understood, etc.

https://chatgpt.com/share/2fde1db5-00cf-404d-9ae5-192aa5ac90...

GPT-4 created a plan very similar to the article, i.e. it also suggested using Python to pre-process data. It also suggested using program synthesis. So I'd say it's already 90% there.

> "Execute the synthesized program on the test inputs."

> "Verify the outputs against the expected results. If the results are incorrect, iteratively refine the hypotheses and rules."

So people saying that it's ad-hoc are wrong. LLMs know how to solve these tasks, they are just not very good at coding, and iterative refinement tooling is in infancy.

157. bubblyworld ◴[18 Jun 24 11:51 UTC] No.40716671{6}[source]▶

>>40716357 #

Very nice! Thanks for the link, that's great inspiration.

158. killerstorm ◴[18 Jun 24 11:52 UTC] No.40716685[source]▶

>>40711484 (OP) #

FWIW GPT-4 is able to generate a plan very similar to one in the article: also involves feature extraction, program synthesis, iterative refinement.

https://chatgpt.com/share/2fde1db5-00cf-404d-9ae5-192aa5ac90...

So it's pretty close to being able to plan solution completely on its own. It's just rather bad at coding and visual inputs, so it doesn't know what it doesn't know.

159. fire_lake ◴[18 Jun 24 11:53 UTC] No.40716693[source]▶

>>40715611 #

S expressions are great for making valid syntax more likely - an old trick from Genetic Programming.

160. lelanthran ◴[18 Jun 24 12:01 UTC] No.40716765{4}[source]▶

>>40716577 #

> 10000 samples are nothing compared to 2^100 possible outputs. It is absolutely, definitely not a "brute search". Testing a small fraction of possibilities (e.g. 0.000001%) is called heuristics, and that's what people use too.

Brute searching literally means generating solutions until one works. Which is exactly what is being done here.

> Please learn a bit of combinatorics.

Don't be condescending - I understand the problem space just fine. Fine enough to realise that the problem was constructed specifically to ensure that "solutions" such as this just won't work.

Which is why this "solution" is straight-up broken (doesn't meet the target, exceeds the computationally bounds, etc).

> It's ridiculous to demand GPT-4 performs as good as a human.

Wasn't the whole point of this prize to spur interest in a new approach to learning? What does GPT-[1234] have to do with the contest rules? Especially since this solution broke those rules anyway?

> Obviously its vision is much worse and it doesn't have 'video' and physics priors people have, so it has to guess more times.

That's precisely my point - it has to guess. Humans aren't guessing for those types of problems (not for the few that I saw anyway).

replies(2): >>40717295 #>>40718880 #

161. bongodongobob ◴[18 Jun 24 12:18 UTC] No.40716922{8}[source]▶

>>40716017 #

So don't use the tool or make your own. You don't get to tell people what to use or like.

162. rfoo ◴[18 Jun 24 12:28 UTC] No.40717004{5}[source]▶

>>40715850 #

First, LLMs are not AGI. Never will be. Can we talk now?

> if given the entire test set.

I don't want the entire test set. Or any single one in the test set.

The problem here is ARC challenge deliberately give a training set with different distribution than both the public and the private test set. It's like having only 1+1=2, 3+5=8, 9+9=18 in training set and then 1+9=10, 5*5=25, 16/2=8, (0!+0!+0!+0!)!=24 in test set.

I can see the argument of "giving the easy problems as demonstration of rules and then with 'intelligence' [1] you should be able to get harder ones (i.e. a different distribution)", but I don't believe it's a good way to benchmark current methods, mainly because there are shortcuts. Like I can teach my kids how factorial works and ! means factorial, instead of teaching them how addition works only and make them figure out how multiplication, division and factorial works and what's the notation.

[1] Whatever that means.

replies(3): >>40719825 #>>40720526 #>>40720676 #

163. ealexhudson ◴[18 Jun 24 12:58 UTC] No.40717295{5}[source]▶

>>40716765 #

I think to be clear, brute force generally means an iterative search of a solution space. I don't think that's what this system is doing, and it's not like it's following some search path and returning as early as possible.

It's similar that a lot of wrong answers are being thrown up, but I think this is more like a probabilistic system which is being pruned than a walk of the solution space. It's much smarter, but not as smart as we would like.

replies(1): >>40717802 #

164. ◴[18 Jun 24 13:04 UTC] No.40717360{3}[source]▶

>>40715482 #

165. ealexhudson ◴[18 Jun 24 13:05 UTC] No.40717373{3}[source]▶

>>40716335 #

We don't really know what GPT-4 "is". I remember reading a number of relatively well-informed suggestions that there are a number of a models inside there, and the API being interacted with is some form of outer-loop around them.

I don't think the location of the outer-loop or the design of it really makes much difference. There is no flock of birds without the individuals, the flock itself doesn't really exist as a tangible thing, but what arises out of the collective adjustments between all these individuals gives rise to a flock. Similarly, we may find groups of LLMs and various outer control loops give rise to an emergent phenomena much greater than the sum of their parts.

replies(1): >>40717527 #

166. YeGoblynQueenne ◴[18 Jun 24 13:20 UTC] No.40717527{4}[source]▶

>>40717373 #

>> We don't really know what GPT-4 "is".

Yes, we do. It's a language model.

167. Fatalist_ma ◴[18 Jun 24 13:31 UTC] No.40717643{3}[source]▶

>>40715384 #

Someone could say that, but we know and we can prove that a car is not a plane. An example of a bad plane would be the Wright Flyer.

168. lelanthran ◴[18 Jun 24 13:45 UTC] No.40717802{6}[source]▶

>>40717295 #

> I think to be clear, brute force generally means an iterative search of a solution space.

Sure, but not an exhaustive one - you stop when you get a solution[1]. Brute force does not require an exhaustive search in order to be called brute-force.

GP was using the argument that because it is not exhaustive, it cannot be brute-force. That's the wrong argument. Brute-force doesn't have to be exhaustive to be brute-force.

[1] Or a good enough solution.

replies(1): >>40717996 #

169. nl ◴[18 Jun 24 13:51 UTC] No.40717870{6}[source]▶

>>40716357 #

Great link, thanks.

170. naasking ◴[18 Jun 24 14:01 UTC] No.40717996{7}[source]▶

>>40717802 #

A brute force search can be expected to find a solution after a more thorough search of the space of possibilities. If it really is only searching 0.000001% of that space before finding solutions, then some structure of the problem is guiding the search and it's no longer brute force.

171. data_maan ◴[18 Jun 24 14:05 UTC] No.40718028[source]▶

>>40712282 #

It's not that novel. Others have implemented this approach , in the context of mathematics.

Already the 2021 paper Drori (and many papers since) did similar things.

It's a common idea in this space...

172. lachlan_gray ◴[18 Jun 24 14:10 UTC] No.40718090{3}[source]▶

>>40715353 #

I’d love to hear more about this!

173. whiplash451 ◴[18 Jun 24 14:15 UTC] No.40718119{3}[source]▶

>>40716432 #

My claim is _not_ that this work is not useful. But however "great" your work is, misleading on the steps you took during your experiments and overselling your results is never a valid approach in research.

replies(1): >>40718747 #

174. ◴[18 Jun 24 14:19 UTC] No.40718175{3}[source]▶

>>40715887 #

175. uptownfunk ◴[18 Jun 24 14:20 UTC] No.40718187[source]▶

>>40711484 (OP) #

Arc agi is a small stepping stone to agi but is not agi.

Program search mimics what humans do to a certain extent but not in entirety.

A more general world model and reference will be required for agi.

176. bongodongobob ◴[18 Jun 24 14:22 UTC] No.40718212{4}[source]▶

>>40713521 #

Our compute architecture has been brute forced via an revolutionary algorithm over a billion years. An LLM approaching our capabilities in like a year is pretty fucking good.

177. whiplash451 ◴[18 Jun 24 14:24 UTC] No.40718230{3}[source]▶

>>40715887 #

Indeed. Francois Chollet himself said during his interview with Dwarkesh that he is not against LLMs and in fact believes that the long-term solution mixes LLMs with something else which has not been discovered yet (his bet is on discrete program search but is open to anything else).

Pitching him against LLMs in such a binary fashion is deceiving and unfair.

178. empath75 ◴[18 Jun 24 15:00 UTC] No.40718593{3}[source]▶

>>40714565 #

> Then stop selling it as a tool to replace humans

I don't understand why people assume that the purpose of any tool is to "replace humans". Automation doesn't replace humans and never has and never will. It simply does certain tasks that humans used to do, freeing people up to do different tasks. There is not a limited amount of work that can be done, there isn't a limited amount of _creative_ work that can be done. Even if AIs were good enough to do every creative task done by humans today (and they aren't and won't be any time soon), that doesn't mean that humans will have nothing of value to produce, or that humans will have been "replaced". There is always going to be work for humans to do, even in a universe where AI have super human capabilities at all tasks.

In particular, human beings strongly value the opinions and creative output of _human beings_ simply for the reason that they are human and similar to them. That will never change, no matter how intelligent that AIs get.

replies(2): >>40719125 #>>40720168 #

179. killerstorm ◴[18 Jun 24 15:14 UTC] No.40718747{4}[source]▶

>>40718119 #

This is a blog post, sir. All details are written down. He's very clear about methods, it seems you're 1) biased; 2) have too high standards for blog posts.

180. kalkin ◴[18 Jun 24 15:19 UTC] No.40718813[source]▶

>>40715123 #

The comic at the beginning paints the "stack more layers" LLM people as clowns, not neurosymbolic people or by proxy Chollet. Yes, it suggests the "stack more layers" approach works anyway, but in a self-deprecating way...

If this article wanted to attack Chollet, it could have made more hay out of another thing that's "hidden in the middle of the article", the note that the solution actually gets 72% on the subset of problems on which humans get ~85%. The fact that the claimed human baseline for ARC-AGI as a whole is based on an easy subset is pretty suspect.

181. cchance ◴[18 Jun 24 15:21 UTC] No.40718832[source]▶

>>40711484 (OP) #

LOL i looked at that first complex test sample and closed the page, it made my brain hurt.

182. coolspot ◴[18 Jun 24 15:27 UTC] No.40718919{6}[source]▶

>>40714044 #

> I’ll give you as much time as you want with an LLM

With infinite amount of time you can LLM brute force whole search space. Infinite monkeys with typewriters.

183. surfingdino ◴[18 Jun 24 15:41 UTC] No.40719125{4}[source]▶

>>40718593 #

> I don't understand why people assume that the purpose of any tool is to "replace humans".

Because of this https://news.ycombinator.com/item?id=40070566

184. lupusreal ◴[18 Jun 24 16:18 UTC] No.40719521{5}[source]▶

>>40715160 #

That's not an interesting argument, all you're doing is staking out words that are reserved for beings with souls or something. It's like saying submarines can't swim. It's an argument about linguistics, not capabilities.

replies(1): >>40720610 #

185. lelanthran ◴[18 Jun 24 16:21 UTC] No.40719551{6}[source]▶

>>40718880 #

> I studied algorithms for years.

Who hasn't?

> You're 100% WRONG on everything you wrote.

Maybe you should update the wikipedia page, then all the other textbooks, that uses a definition of brute-force that matches my understanding of it.

From https://en.wikipedia.org/wiki/Brute-force_search

> Therefore, brute-force search is typically used when the problem size is limited, or when there are problem-specific heuristics that can be used to reduce the set of candidate solutions to a manageable size.

Further, in the same page https://en.wikipedia.org/wiki/Brute-force_search#Speeding_up...

> One way to speed up a brute-force algorithm is to reduce the search space, that is, the set of candidate solutions, by using heuristics specific to the problem class.

I mean, the approach under discussion is literally exactly this.

Now, Mr "ACM ICPC, studied algorithms for years", where's your reference that reducing the solution space using heuristics results in a non-brute-force algorithm?

replies(2): >>40720086 #>>40721550 #

186. jononor ◴[18 Jun 24 16:27 UTC] No.40719608{3}[source]▶

>>40715482 #

Do you have any suggestions for a better approach of testing artificial intelligence? I mean, in a way that allows comparing different approaches and being a reasonable metric of progress.

replies(1): >>40720015 #

187. lesuorac ◴[18 Jun 24 16:32 UTC] No.40719660{6}[source]▶

>>40716388 #

> that ``brute force" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.

Mostly tangential to the article but I never really like this argument. Like you're playing a game a specific way and somebody else comes in with a new approach and mops the floor with you and you're going to tell me "they played wrong"? Like no, you were playing wrong the whole time.

replies(2): >>40722227 #>>40723528 #

188. advael ◴[18 Jun 24 16:51 UTC] No.40719825{6}[source]▶

>>40717004 #

Okay I admit I'm confused and think I probably missed a crucial thing here. You're saying the publicly available problem set isn't indicative of the distribution of the test set? If so, I can see why you object to that. Still, it's potentially possible that the test's intention is to demonstrate something like progressive integration of compositionality given an abstract model. A lot of machine learning systems can do well as long as they've seen an example similar to the problem they've been presented, but can't do things like respond to a situation that presents them with a novel composition of two abstractions they seem to have already learned in the way a human can trivially.

Like only having [1+1=2, 4+5=9, 2+10=12] in the training set and [2*5=10, 3/4=.75, 2^8=256] in the test set would be bad, but something like [1+1=2, 3+4*2=11, 5*3=15, 2*7=14, 1+3/5=1.8, 3^3=27] vs [2+4*3=14, 3+3^2+4=16, 2*3/4+2^3/2^4=2] might not be, depending on what they're trying to test

Compositionality of information, especially of abstractions (like rules or models of a phenomenon), is a key criterion in a lot of people's attempts to operationally define "intelligence" (which I agree is overall a nebulous and overloaded concept, but if we're going to make claims about it we need at least a working definition for any particular test we're doing) I could see that meaning that the test set problems need to be "harder" in the sense that presenting compositions of rules in training doesn't preclude memorizing the combinations. But this is just a guess, I'm not involved in ARC and don't know, obviously*

replies(1): >>40720838 #

189. YeGoblynQueenne ◴[18 Jun 24 17:13 UTC] No.40720015{4}[source]▶

>>40719608 #

I don't. I'm guessing -and it's nothing but a guess- that for every problem that can be solved with intelligence there exists a solution that does not require intelligence. I'm guessing in other words that intelligence is the ability to come up with solutions to arbitrary problems. If that's true then there's no way to test for intelligence by looking at the performance of a system at any particular task, or any finite set of tasks, and so there's no way to create a "test for intelligence".

My guess is supported by the experience that, in AI research, every time someone came up with a plausible test for intelligence, an AI system eventually passed the test only to make it clear that the test was not really testing intelligence after all (edit: I don't just mean formal tests; e.g. see how chess used to "require intelligence" right up until Deep Blue vs Kasparov).

Some people see that as "moving the goalposts" and it's certainly frustrating but the point is that we don't know what intelligence is, exactly, so it's very hard to test for its existence or not, or to measure it.

My preference would be for everyone in AI research to either stop what they're doing and try to understand what the hell intelligence is in the first place, to create a theory of intelligence so that AI can be a scientific subject again, or to at least admit they're not interested in creating artificial intelligence. I, for example, am not, but all my background is in subjects that are traditionally labelled "AI" so I have to suck it up, I guess.

replies(1): >>40720977 #

190. baobabKoodaa ◴[18 Jun 24 17:21 UTC] No.40720086{7}[source]▶

>>40719551 #

This was extremely cringe worthy to read. You are confidently wrong about this. You are trying to redefine well established terminology. I don't care about whatever random wiki page you might find to "support your claims". Anybody who has worked a lot on algorithms (including myself) knows what brute force means and this is not it.

Also: lol at your "who hasn't" comment. Because you clearly haven't.

replies(1): >>40720217 #

191. YeGoblynQueenne ◴[18 Jun 24 17:28 UTC] No.40720145{6}[source]▶

>>40715903 #

Yeah, very good points. To be fair there are people who have argued the big data side who have clearly solid knowledge of AI and are not just SV suits, for example I remember Yann LeCun in a debate with Christopher Manning, where Manning was arguing for the importance of "structure" and LeCun was arguing against it. Or see the "Bitter Lesson", mentioned in a parent comment. That may have become a total shibboleth of the Silicon bros but Rich Sutton, who wrote the eponymous article, is the guy who wrote the book on Reinforcement Learning (literally). And then Rodney Brooks' replied with his "Better Lesson" (https://rodneybrooks.com/a-better-lesson/). So there's a lot of debate in this and I don't reckon we'll have a consensus soon. It should be clear which side I'm on- I work with firmly model-based AI ("planning is the model-based approach to autonomous behaviour" has become my shibboleth - see Bonnet and Geffner's book on planning: https://link.springer.com/book/10.1007/978-3-031-01564-9) so maybe it's a deformation professionelle. And even LCun's recent plans for JEPA are very consciously model-based, except he wants to learn his models from data; which is not a bad idea I suppose.

replies(2): >>40720643 #>>40721179 #

192. ◴[18 Jun 24 17:30 UTC] No.40720168{4}[source]▶

>>40718593 #

193. YeGoblynQueenne ◴[18 Jun 24 17:35 UTC] No.40720214{6}[source]▶

>>40716650 #

But he tuned the "test" side of the generate-and-test loop, not the "generate" side. The "generate" side remains a big permutation generator that is also very hard to control. The current highest-ranked system on the private test set of the ARC-AGI (at 34%) is another LLM fine-tuned on manually created examples of ARC tasks, so that would indeed be messing with the generator part of the loop. I'm guessing performance will jump when someone puts the two together.

A heuristic btw, is something completely different than fine tuning, or filtering. Heuristic search is the closest thing we have to an approximation of the kind of goal-driven behaviour we see in animal intelligence.

I think you could argue that gradient optimisation or any kind of optimisation of some kind of objective function is the same (Rich Sutton has a paper titled "Reward is all you need"). I'm not sure where I stand with that.

replies(1): >>40725485 #

194. lelanthran ◴[18 Jun 24 17:35 UTC] No.40720217{8}[source]▶

>>40720086 #

> You are trying to redefine well established terminology.

Reference? Link, even?

> don't care about whatever random wiki page you might find to "support your claims".

That isn't some "random wiki" page; that's the wikipedia page for this specific term.

I'm not claiming to have defined this term, I'm literally saying I only agree with the sources for this term.

> Also: lol at your "who hasn't" comment. Because you clearly haven't.

Talk about cringe-worthy.

replies(1): >>40720602 #

195. blobbers ◴[18 Jun 24 18:09 UTC] No.40720526{6}[source]▶

>>40717004 #

The problem is there is no way to infer the right answer to 0! given the training. You need more context to learn it. Humans need more context. If you put that at the end of every grade 1 math test no student would get it right unless they had some context.

Do grade 1 kids have AGI? (Haha)

But seriously, all professions need to train in context to solve complex problems. You can train in adjacent realms and reason about problems but to truly perform, you need more training.

A general surgeon might be better than an electrician as a vet, but that I’d rather have a veterinary surgeon operate on my dog.

So some things are “AGI” able and other things need specific training.

replies(1): >>40720905 #

196. baobabKoodaa ◴[18 Jun 24 18:18 UTC] No.40720602{9}[source]▶

>>40720217 #

> Reference? Link, even?

Sure, here's definition for "brute force" from university textbook material written by pllk, who has taught algorithms for 20 years and holds a 2400 rating on Codeforces:

https://tira.mooc.fi/kevat-2024/osa9/

"Yleispätevä tapa ratkaista hakuongelmia on toteuttaa raakaan voimaan (brute force) perustuva haku, joka käy läpi kaikki ratkaisut yksi kerrallaan."

edit:

Here's an English language book written by the same author, though the English source does not precisely define the term:

https://cses.fi/book/book.pdf

In chapter 5:

"Complete search is a general method that can be used to solve almost any algorithm problem. The idea is to generate all possible solutions to the problem using brute force ..."

And a bit further down chapter 5:

"We can often optimize backtracking by pruning the search tree. The idea is to add ”intelligence” to the algorithm so that it will notice as soon as possible if a partial solution cannot be extended to a complete solution. Such optimizations can have a tremendous effect on the efficiency of the search."

Your mistake is that you for some reason believe that any search over solution space is a brute force solution. But there are many ways to search over a solution space. A "dumb search" over solution space is generally considered to be brute force, whereas a "smart search" is generally not considered to be brute force.

Here's the Codeforces profile of the author: https://codeforces.com/profile/pllk

edit 2:

Ok now I think I understand what causes your confusion. When an author writes "One way to speed up a brute-force algorithm ..." you think that the algorithm can still be called "brute force" after whatever optimizations were applied. No. That's not what that text means. This is like saying "One way to make a gray car more colorful is by painting it red". Is it still a gray car after it has been painted red? No it is not.

197. dahart ◴[18 Jun 24 18:19 UTC] No.40720610{6}[source]▶

>>40719521 #

Disagree. The ability to “try” is a capability, and GPT doesn’t have it. Everything in the world that we’ve called “intelligent” up to this point has had autonomy and self motivation, and GPT doesn’t have those things. GPT doesn’t grow and doesn’t learn from its mistakes. GPT won’t act without a prompt, this is a basic fact of its design, and I’m not sure why people are suddenly confused about this.

replies(1): >>40729644 #

198. imperfect_light ◴[18 Jun 24 18:20 UTC] No.40720617{10}[source]▶

>>40714964 #

He's said it pretty clearly, an LLM could be part of the solution in combination with program synthesis, but an LLM alone won't achieve AGI.

199. advael ◴[18 Jun 24 18:24 UTC] No.40720643{7}[source]▶

>>40720145 #

I've commented here before that I find myself really conflicted on LeCunn's public statements. I think it's really hard to reconcile the fact that he's undeniably a world-leading expert with the fact that he does work for and represent a tech company in a big way, which means that it's both hard to tell when what he says, especially publicly, is filtered through that lens, either explicitly or just via cultural osmosis. I know some people still in academia (e.g. "Bitter Lesson") are following suit but given how much of this field has been scooped up by large tech firms, this necessarily means that what we get out of research from those firms is partially filtered through them. Like it sounds like you're in CS/AI academia so I'm sure you're familiar with the distorting effect this brain drain has had on the field. Research out of places like FAIR or deepmind or OpenAI (arguably they were different until about 2019 or so? Hard to say how much of that was ever true unfortunately) are being done and published by world-leading experts hired by these companies and obviously this research has continued to be crucial to the field, but the fact that it's in industry means there's obviously controls on what they can publish, and the culture of an institution like Facebook is definitely going to have some different effects on priorities than that of most universities, and so while we can all collectively try to take it all with a grain of salt in some way, there is no way to be careful enough to avoid tribal knowledge in the field being heavily influenced by the cultures and priorities of these organizations.

But even if this kind of thinking is totally organic, I think it could arise from the delayed nature of the results of data-driven methods. Often a major structural breakthrough for a data-driven approach drastically predates the most obviously impactful results from that breakthrough, because the result impressive enough to draw people's attention comes from throwing lots of data and compute at the breakthrough. The people who got the impressive result might not even be the same team as the one that invented the structure they're relying on, and it's really easy to get the impression that what changed the game was the scale alone, I imagine even if you're on one of those research teams. I've been really impressed by some of the lines of research that show that you can often distill some of these results to not rely so heavily on massive datasets and enormous parallel training runs, and think we should properly view results that come from these to be demonstrations of the power of the underlying structural insights rather than new results. But I think this clashes with the organizational priorities of large tech firms, which often view scale as a moat, and thus are motivated to emphasize the need for it

replies(1): >>40723513 #

200. astromaniak ◴[18 Jun 24 18:28 UTC] No.40720676{6}[source]▶

>>40717004 #

> First, LLMs are not AGI.

It's the most generic thing we have right now, right?

> Never will be.

If there is no other breakthrough anytime soon we can engineer AGI-like things around LLMs. I mean LLM trained to use different attachments. Which can be other models and algorithms. Examples will be image recognition models and databases for algorithms. Even now ChatGPT can use Bing search and Python interpreter. First steps done, others will follow. The result will be not a true AGI, but still a very capable system. And there is another factor. Next models can be trained on high quality data generated by current models. Instead of internet random garbage. This should improve their spacial and logical abilities.

replies(1): >>40725199 #

201. dahart ◴[18 Jun 24 18:39 UTC] No.40720764{6}[source]▶

>>40715775 #

You’re trying to avoid addressing my point. What can GPT do that’s interesting without a human in the loop doing the prompting?

Lol “very poor”. You’re attempting to argue that if there’s any output at all in response to an input prompt, then GPT is “trying” and showing signs of intelligence, no matter what the output is. By this logic, you contradicted yourself: the chess engine can play checkers, poorly. By this logic, asking the sky to play a game means the sky is trying because it changes, or asking a random number generator to play a game means it resembles an attempt to play because there is “very poor” output.

There are lots of games GPT can’t play, like hide-and-seek, tag, and tennis. Playing a game means playing by the rules of the game, giving coherent output, and trying to win. GPT can’t play games it hasn’t seen before, and no I don’t agree that “very poor” output counts. It doesn’t (currently) learn the rules from your prompts; you can’t teach it to play a new game by talking to it, and the “very poor” output from a game it wasn’t trained on will never improve. And, to my actual point, GPT will not play any games at all unless you ask it to.

replies(2): >>40721802 #>>40725464 #

202. ◴[18 Jun 24 18:44 UTC] No.40720800{3}[source]▶

>>40715482 #

203. elicksaur ◴[18 Jun 24 18:44 UTC] No.40720803{7}[source]▶

>>40716472 #

Hey! Genuinely, thank you for the feedback!

204. rfoo ◴[18 Jun 24 18:48 UTC] No.40720838{7}[source]▶

>>40719825 #

> You're saying the publicly available problem set isn't indicative of the distribution of the test set?

Yes. From https://arcprize.org/guide:

    Please note that the public training set consists of simpler tasks whereas the public evaluation set is roughly the same level of difficulty as the private test set.
    The public training set is significantly easier than the others (public evaluation and private evaluation set) since it contains many "curriculum" type tasks intended to demonstrate Core Knowledge systems. It's like a tutorial level.

replies(1): >>40720966 #

205. mrtranscendence ◴[18 Jun 24 18:54 UTC] No.40720903{3}[source]▶

>>40712465 #

Sometimes, when I'm undertaking the arduous work of assigning probabilities to everything I could possibly say next in a conversation, I wish that I weren't merely a stochastic autoregressive next-token generator. Them's the breaks, though.

206. advael ◴[18 Jun 24 18:54 UTC] No.40720905{7}[source]▶

>>40720526 #

I think there's variance in people's degree of compositionality, as well as how quickly they can pick up on novel relationships. Testing "intelligence" in humans has always been kind of fraught in the first place, but any capability we may care to measure is going to permit degrees, and there will be some variance in humans on it. We should expect this. There's variance in goddam everything

We should also expect machine learning systems to have somewhat different properties from human minds. Like computers are more likely to accomplish perfect recall, and we can scale the size of their memory and their processing speed. All these confounding variables can make it hard to make binary tests of a capability, which is really what ARC seems like it's trying to do. One such capability that AI researchers will often talk about is conceptual compositionality. People care about compositionality because it's a good way to demonstrate that an abstract model is being used to reason about a situation, which can be used in unseen but perhaps conceptually similar situations. This "generalization" or "abstraction" capability is really the goal, but it's hard to reason about how to test it, and "composition" (That is, taking a situation that's novel, but a straightforward application of two or more different abstractions the agent should already "know") is one more testable way to try to tease it out.

As you point out, humans often fail this kind of test, and we can rightly claim that in those cases, they didn't correctly grasp the insight we were hoping they had. Testing distilled abstractions versus memorization or superficial pattern recognition isn't just important to AI research, it's also a key problem in lots of places in human education

207. elicksaur ◴[18 Jun 24 18:58 UTC] No.40720941{7}[source]▶

>>40714174 #

From the leaderboard link (and on the archive version):

>ARC-AGI-Pub is a secondary leaderboard (in beta) measuring the public evaluation set. … The public evaluation set imposes no limitations on internet access or compute. At this time, ARG-AGI-Pub is not part of ARC Prize 2024 (eg. no prizes are associated with this leaderboard).

And, all the entries at time of writing and in the archive link say “You?…”. “ARC-AGI 2024 HIGH SCORES” which does have entries is on the private test set.

>I don't think you "don't understand" anything :)

I genuinely don’t understand if we are viewing the same websites.

replies(1): >>40721090 #

208. advael ◴[18 Jun 24 19:00 UTC] No.40720966{8}[source]▶

>>40720838 #

Well, in this paragraph they seem to explain that their public evaluation set is meant to be indicative of the kind of jump in difficulty you can expect from the private test set. This to me implies that my guess is close: They're looking for models that can learn simple concepts and apply them to complex problems. Keeping the test set private seems to be an attempt at making it difficult to "cheat" at this by simply memorizing superficial details of the more complex problem set, which makes sense given that the whole point of this seems to be testing for systems that can use learned abstractions to tackle novel, out-of-distribution problems

Like with our toy "algebra" examples, sure there's a lot of emphasis on repetition and rote in primary education on these subjects, and that's one way to get people more consistent at getting the calculations right, but to be frank I don't think it's the best way, or as crucial as it's made out to be. What someone really needs to understand about algebra is how the notation works and what the symbols mean. Like I can't unsee the concept of "+" as a function that takes two operands and starts counting for as many steps as one would in the right operand, starting at the value of the left operand. When looking at algebra, the process I go through relies on a bunch of conceptual frameworks, like "Anything in the set of all arabic numerals can be considered a literal value". "Anything in the roman alphabet is likely a variable". "Any symbol is likely an infix operator, that is, a function whose operands are on either side of it". Some of the concepts I'm using are just notational convention. At some point I memorized the set of arabic numerals, what they look like, what each of them means, how they're generally written in relation to each other to express quantities combinatorically. Some of the concepts are logical relations about quantities, or definitions of functions. But crucially, the form of these distillations makes them composable. If I didn't really understand what "+" does, then maybe someone could give me some really bad homework that goes

1 + 30 = 31

20 + 7 = 27

3 + 10 = 13

And then present me the problem

20 + 10 + 3 = ?

And I'd think the answer is

20 + 10 + 3 = 213

That demonstrates some model of how to do these calculations, but it doesn't really capture all the important relationships the symbols represent

We can have any number of objections to this training set. Like I wasn't presented with any examples of adding two-digit numbers together! OR even any examples where I needed to combine numbers in the same rank!

Definitely all true. Probably mistakes we could make in educating a kid on algebraic notation too. It's really hard to do these things in a way that's both accomplishing the goal and testable, quantifiable. But many humans demonstrate the ability to distill conceptual understanding of concepts without exhaustive examples of their properties, so that's one of the things ARC seems to want to test. It's hard to get this perfectly right, but it's a reasonable thing to want

replies(1): >>40725300 #

209. Nimitz14 ◴[18 Jun 24 19:01 UTC] No.40720977{5}[source]▶

>>40720015 #

You're basically paraphrasing fchollet's paper on intelligence and what he talked about in his most recent podcast appearance with dwarkesh.

replies(1): >>40726505 #

210. refulgentis ◴[18 Jun 24 19:11 UTC] No.40721090{8}[source]▶

>>40720941 #

> I genuinely don’t understand if we are viewing the same websites.

We are! I missed the nuance on you're looking for a public leaderboard on the private test set. I do see it now, but I'm still confused as to how that's relevant here.

211. bshanks ◴[18 Jun 24 19:19 UTC] No.40721168[source]▶

>>40713006 #

Yes, I think it's very possible that human brains unconsciously generate-and-test surprisingly large numbers of small candidate programs when solving a problem.

This approach is https://en.wikipedia.org/wiki/Embarrassingly_parallel, which is a good fit for biological neural architectures, which have very many computing nodes but each node is very slow (compared to electronic computer CPUs/GPUs).

212. barfbagginus ◴[18 Jun 24 19:21 UTC] No.40721179{7}[source]▶

>>40720145 #

The recent result shows SOTA progress from something as goofy as generating 5000 python programs until 0.06% of them pass the unit tests. We can imagine our own brains having a thousand random subconscious pre thoughts before our consciously registered though is chosen and amplified out of the hallucinatory subconscious noise. We're still at a point where we're making surprising progress from simple feedback loops, external tools and checkers, retries, backtracking, and other bells and whistles to the LLM model. Some of these even look like world models.

So maybe we can cure LLMs of the hallucinatory leprosy just by bathing them about 333 times in the mundane Jordan river of incremental bolt ons and modifications to formulas.

You should be able to think of the LLM as a random hallucination generator then ask yourself "how do I wire ten thousand random hallucination generators together into a brain?" It's almost certain that there's an answer... And it's almost certain that the answer is even going to be very simple in hindsight. Why? Because llms are already more versatile than the most basic components of the brain and we have not yet integrated them in the scale that components are integrated in the brain.

It's very likely that this is what our brains do at the component level - we run a bunch of feedback coupled hallucination generators that, when we're healthy, generates a balanced and generalizing consciousness - a persistent, reality coupled hallucinatory experience that we sense and interpret and work within as the world model. That just emerges from a network of self correcting natural hallucinators. For evidence, consider work in Cortical Columns and the Thousand brains theory. This suggests our brains have about a million Cortical Columns. Each loads up random inaccurate models of the world... And when we do integration and error correction over that, we get a high level conscious overlay. Sounds like what the author of the currently discussed SOTA did, but with far more sophistication. If the simplest most obvious approach to jamming 5,000 llms together into a brain gives us some mileage, then it's likely that more reasoned and intelligent approach could get these things doing feats like the fundamentally error prone components of our own brains can do when working together.

So I see absolutely no reason we couldn't build an analogy of that with llms as the base hallucinator. They are versatile and accurate enough. We could also use online training llms and working memory buffers as the base components of a Jepa model.

It's pretty easy to imagine that a society of 5000 gpt4 hallucinators could, with the right self administered balances and utilities, find the right answers. That's what the author did to win the 50%.

Therefore I propose that for the current generation it's okay to just mash a bunch of hallucinators together and whip them into the truth. We should be able to do it because our brains have to be able to do it. And if you're really smart, you will find a very efficient mathematical decomposition... Or a totally new model. But for every current LLM inability, it's likely to turn out that sequence of simple modifications can solve it. Will probably accrue a large number of such modifications before someone comes along and thinks of an all-new model then does way better, perhaps taking inspirations from the proposed solutions, or perhaps exploring the negative space around those solutions.

replies(1): >>40723557 #

213. killerstorm ◴[18 Jun 24 19:54 UTC] No.40721550{7}[source]▶

>>40719551 #

You're asking for a definition of exhaustive search. Exhaustive search, by definition, goes through the entire search space. That's what word exhaustive means.

For a reference, check Cormen's "Introduction to Algorithms". Every mention of brute-force search is specifically to exhaustive search is which not feasible for bigger spaces.

> I mean, the approach under discussion is literally exactly this.

It's literally not. It DOES NOT REDUCE the candidate set. It generates most likely candidates, but it doesn't reduce anything.

You lack basic understanding. Solutions are pixels grids, not Python programs. There's no search over pixel grids in the article. Not every search is exhaustive search.

This is like saying theoretical physicists are "brute-forcing" physics by generating candidate theories and testing them. Ridiculous.

214. machiaweliczny ◴[18 Jun 24 20:03 UTC] No.40721660[source]▶

>>40711484 (OP) #

This challenge looks quite solvable but it's relies on physics understanding and it's has a lot of human/world priors in sense of space understanding and object boundaries.

Seems like it relies on identification of objects and then mapping them somehow. Most of the cases so far that I've seen are based on some transformation or relation between the objects.

So far it seems like some search among common transformatiosn and relations could solve it. Plus some heuristics/computation for counting order, wholeness(boundary) or pattern.

IMO it can be solved by search of programs that combine these + some LLM to guide heuristics most likely.

The only hard one was applied noise or one testing understanding of "gravity".

Did anyone test human baseline for this?

215. CamperBob2 ◴[18 Jun 24 20:18 UTC] No.40721802{7}[source]▶

>>40720764 #

What can GPT do that’s interesting without a human in the loop doing the prompting?

Understand what the human in the loop doing the prompting is asking for, for one thing.

The magical aspects of LLMs are on the input side, not the output.

replies(1): >>40722148 #

216. dahart ◴[18 Jun 24 20:58 UTC] No.40722148{8}[source]▶

>>40721802 #

This probably isn’t what you meant, but if the magic is on the input side, then everything interesting about interacting with GPT is being provided by the human and not GPT.

We don’t have any strong evidence that GPT “understands” its input in general. We absolutely have examples of GPT failing to understand some inputs (and not knowing it, and insisting on bogus output). And we know for a fact that it was designed and built to produce statistically plausible output. GPT is a mechanical device designed by humans to pass the Turing test. We’ve designed and built something that is exceptionally good at making humans believe it is smarter than it is.

replies(1): >>40722551 #

217. entropicdrifter ◴[18 Jun 24 21:11 UTC] No.40722227{7}[source]▶

>>40719660 #

Yeah, people get salty when their preconceptions are shattered, especially when they've invested a lot of time/energy in thinking based on the idea that they were sound.

It goes beyond simple sunk cost and into the realm of reality slapping them with a harsh "humans aren't special, grow up", which I think is especially bitter for people who aren't already absurdists or nihilists.

218. CamperBob2 ◴[18 Jun 24 21:59 UTC] No.40722551{9}[source]▶

>>40722148 #

We’ve designed and built something that is exceptionally good at making humans believe it is smarter than it is.

Yep, and even ELIZA could do that, to some extent. But at some point you'll need to define what "understanding" means, and explain why an LLM isn't doing it.

replies(1): >>40722950 #

219. dahart ◴[18 Jun 24 22:55 UTC] No.40722950{10}[source]▶

>>40722551 #

Totally, and that’s a fair point. I don’t know what understanding means, not enough to prove an LLM can’t, anyway, and I think nobody has a good enough definition yet to satisfy this crowd. But I think we can make progress with nothing more than the dictionary definition of “understand”, which is the ability to perceive and interpret. I think we can probably agree that a rock doesn’t understand. And we can probably also agree that a random number generator doesn’t understand. The problem with @FeepingCreature’s argument is that the quality of the response does matter. The ability for a machine that’s specifically designed to wait for input and then provide an output, to then provide a low quality response, doesn’t demonstrate any more intelligence than a bicycle… right? I don’t know where the line is between my random writer Markov chain text generator from college and today’s LLMs. I’m told transformers are fundamentally the same and just have an adaptive window size. More training data then is the primary difference. So then we are saying Excel’s least-squares function fitter does not understand, unless the function has a billion data points? Or, if there’s a line, what does it look like and where is it?

220. infgeoax ◴[19 Jun 24 00:03 UTC] No.40723361{9}[source]▶

>>40714614 #

Hmm, but is it really "generalizing" or just pulling information from the training data? I think that's what this benchmark is really about: to adapt to something it has never seen before quickly.

221. YeGoblynQueenne ◴[19 Jun 24 00:32 UTC] No.40723513{8}[source]▶

>>40720643 #

Absolutely, industry and its neverending piggy bank have had a severe distorting effect on the direction of research. I'm a post-doc btw, right now working on robotic autonomy. I don't have direct experience of the brain drain- I'm in a UK university- but I can see the obvious results in the published research which has very suddenly lurched towards LLMs recently, as it did a very sudden lurch towards CNNs after 2012 etc.

Like you say, large tech corps clearly see big data approaches as a moat, as a game that they can play better than anyone else: they got the data, they got the compute, and they got the millions to hoover up all the "talent". Obviously, when it's corporations driving research they are not going to drive it towards a deepening of understanding and an enriching of knowledge, the only thing they care about is selling stuff to make money, and to hell with whether that stuff works or not and why. I'm worried even that this is going to have a degrading effect on the output of science and technology in general, not just AI and CS. It's like a substantial minority of many fields of science have given up on basic research and are instead feeding data to big neural nets and poking LLMs to see what will fall out. This is a very bad situation. Not a winter but an Eternal Summer.

Take it away, Tom.

https://youtu.be/0qjCrX9KS88?si=CEXr0cv-lUeG4UNO

replies(1): >>40729684 #

222. YeGoblynQueenne ◴[19 Jun 24 00:35 UTC] No.40723528{7}[source]▶

>>40719660 #

No the reason for the disappointment was that early AI pioneers considered chess a model of human intelligence and they expected a chess-playing AI to help them understand how human intelligence works. To have computer chess devolve into a race to beat human champions using techniques that only computers can use clearly defeated this purpose.

Those "early pioneers" were people like Alan Turing, Claude Shannon, Marvin Minsky, Donald Michie and John McCarthy, all of whom were chess players themselves and were prone to thinking of computer chess as a window into the inner workings of the human mind. Here's what McCarthy had to say when Deep Blue beat Kasparov:

In 1965 the Russian mathematician Alexander Kronrod said, "Chess is the Drosophila of artificial intelligence." However, computer chess has developed much as genetics might have if the geneticists had concentrated their efforts starting in 1910 on breeding racing Drosophila. We would have some science, but mainly we would have very fast fruit flies.

Three features of human chess play are required by computer programs when they face harder problems than chess. Two of them were used by early chess programs but were abandoned in substituting computer power for thought.

http://www-formal.stanford.edu/jmc/newborn/newborn.html

Then he goes on to discuss those three features of human chess play. It doesn't really matter which they are but it's clear that he is not complaining about anyone "playing wrong", he's complaining about computer chess taking a direction that fails to contribute to a scientific understanding of human, and I would also say machine, intelligence.

223. YeGoblynQueenne ◴[19 Jun 24 00:39 UTC] No.40723557{8}[source]▶

>>40721179 #

Thanks for the comment but I have to say: woa there, hold your horses. Hallucinations as the basis of intelligence? Why?

Think about it this way: ten years ago, would you think that hallucinations have anything to do with intelligence? If it were 2012, would you think that convolutions, or ReLus, are the basis of intelligence instead?

I'm saying there is a clear tendency within AI research, and without, to assume that whatever big new idea is currently trending is "it" and that's how we solve AI. Every generation of AI reseachers since the 1940's has fallen down that pit. In fact, no lesser men than Walter Pitts and Warren McCulloch, the inventors of the artificial neuron in 1943, firmly believed that the basis of intelligence is propositional logic. That's right. Propositional logic. That was the hot stuff at the time. Besides, the first artificial neuron was a propositional logic circuit that learned its own boolean function.

So keep an eye out for being carried away on the wings of the latest hype and thinking we got the solution to every problem just because we can do yet another thing, with computers, that we couldn't do before.

224. advael ◴[19 Jun 24 05:44 UTC] No.40725199{7}[source]▶

>>40720676 #

The balance of evidence seems to suggest that training on model outputs leads to a significant degradation in model accuracy, some have even called this "model collapse." I'm not going to say it's impossible that this situation can improve, but your intuition that high quality generated outputs are an obvious means to bootstrap the next leap forward in model quality is definitely a contrarian position as I understand it

replies(1): >>40747829 #

225. rfoo ◴[19 Jun 24 06:03 UTC] No.40725300{9}[source]▶

>>40720966 #

I agree. However, it is not a clear cut what's fair and what is "gaming the benchmark" in this setup, for example:

- can I train on my own private training set (which is harder)?

- can I pretrain on The Pile or something similar, a dataset full of texts crawled from web?

- can I pretrain on elementary school textbooks?

It seems like the latter two is acceptable given the use of GPT-4o here. But then, are the latter two that different to the first one? GPT-4o have the public test set in its training data (GPT-4o is definitely trained on public GitHub repos).

What's the point of having a training set with different distribution in this case, other than making participating harder? Maybe it's to discourage data-hungry approaches, but if there are legit shortcuts, anyone who seriously want to win would take it.

226. FeepingCreature ◴[19 Jun 24 06:33 UTC] No.40725464{7}[source]▶

>>40720764 #

> You’re trying to avoid addressing my point. What can GPT do that’s interesting without a human in the loop doing the prompting?

Oh, not much. Training a network to do a prompted task is a lot easier than training a network to do unguided action selection because we don't really have a good textual dataset for that.

> the chess engine can play checkers, poorly.

Disagree: you're not gonna get any valid checkers moves out of the engine. It's a 0-to-1 thing. GPT-3 gets you to 1. There's a lot of Youtube videos of GPT-3 doing chess games that basically get through the opening and then fall over with invalid moves. This is the level of performance I refer to.

> There are lots of games GPT can’t play, like hide-and-seek, tag, and tennis.

Disagree: you can probably get GPT to play these. Show it a picture of the area and have it select a hiding spot or give you a search order. For tennis... something like have it write Python code for a tennis bot using image recognition? It'll probably fail, but it'll fail in the course of a recognizable attempt.

> It doesn’t (currently) learn the rules from your prompts; you can’t teach it to play a new game by talking to it

There's papers saying you're wrong, have you actually tried this? The whole novelty of 3+ was in-context learning.

replies(1): >>40728447 #

227. bubblyworld ◴[19 Jun 24 06:37 UTC] No.40725485{7}[source]▶

>>40720214 #

There's a list of things the author did to change the "generate" side in the first two paragraphs of the article.

The heuristic isn't the fine-tuning, it's the actual LLM, which is clearly pruning the set of possibilities massively. That's a reasonably common usage of the word. I agree combining it with some kind of search would be interesting, but still I think you're being overly negative about the results here.

I'm actually busy training an alphazero for the arc problems, which I plan to try and hook up to a language model for reward generation, so we'll see how that fares!

I've read that paper, but thanks for the reference, this comment section is a goldmine.

replies(1): >>40726591 #

228. YeGoblynQueenne ◴[19 Jun 24 09:29 UTC] No.40726505{6}[source]▶

>>40720977 #

I watched the podcast you're referencing, on youtube, but I don't remember Chollet saying anything like what I say above.

Quite the contrary, Chollet seems convinced that a test for artificial intelligence, like an IQ test for AI, can be created and he has not only created one but also organised a Kaggle competition on it, and now is offering a $1 million prize to solve it. So how is anything he says or does compatible with what I say above, that it's likely there can't be a test for artificial intelligence?

replies(1): >>40760861 #

229. YeGoblynQueenne ◴[19 Jun 24 09:41 UTC] No.40726591{8}[source]▶

>>40725485 #

>> There's a list of things the author did to change the "generate" side in the first two paragraphs of the article.

I can't see where that is. All I can see the author saying they did is prompting and filtering of returned answers, none of which is going anywhere near the weights of the language model (that's where I'm claiming the "generator" is residing).

>> I'm actually busy training an alphazero for the arc problems, which I plan to try and hook up to a language model for reward generation, so we'll see how that fares!

That sounds exciting. Good luck with your effort!

replies(1): >>40730107 #

230. machiaweliczny ◴[19 Jun 24 12:50 UTC] No.40727801[source]▶

>>40712792 #

The task provides json grid of numbers

231. dahart ◴[19 Jun 24 14:07 UTC] No.40728447{8}[source]▶

>>40725464 #

You moved your own goal posts. You said “very poor” quality counted, and referred to “any game imaginable”, not to only valid moves. We absolutely can ask GPT to play a game it doesn’t know, and fail to get any valid moves. I was trying to give your argument the benefit of the doubt, but I can’t do that when your argument is moving and inconsistent. You also failed to address the RNG, which always has a non-zero probability of returning a valid move.

The papers you refer to do not say I’m wrong. In-context learning doesn’t stick, the neurons don’t change or adapt, the model doesn’t grow, and GPT forgets everything you told it… while you play the game I imagined.

There are also papers demonstrating how bad GPT can be and reveal that it doesn’t understand, it’s just good at mimicking humans that understand.

“ChatGPT is bullshit” https://link.springer.com/article/10.1007/s10676-024-09775-5

(Note the important argument there that the more you insist that GPT has agency, the stronger the evidence becomes that its hallucinations are intentional lies and not just innocent accidents. Be careful what you wish for.)

“GPT-4 can’t reason”. https://medium.com/@konstantine_45825/gpt-4-cant-reason-2eab...

A fun example of LLMs being unable to understand their prompts is when you ask it not to do something. It picks up on the tokens you use and happily does the thing you ask it not to.

replies(1): >>40729120 #

232. FeepingCreature ◴[19 Jun 24 15:20 UTC] No.40729120{9}[source]▶

>>40728447 #

> The papers you refer to do not say I’m wrong. In-context learning doesn’t stick, the neurons don’t change or adapt, the model doesn’t grow, and GPT forgets everything you told it… while you play the game I imagined.

By this argument a human with anterograde amnesia is not a general reasoner.

To be clear, I don't think GPT has significant or humanlike amounts of agency and understanding. I do think it has noticeable amounts of agency and understanding, whereas a RNG has no agency or understanding. In other words, GPT will try to play games and make valid moves at a level clearly above chance.

> A fun example of LLMs being unable to understand their prompts is when you ask it not to do something. It picks up on the tokens you use and happily does the thing you ask it not to.

Any human affected by reverse psychology: clearly not a general reasoner...

This is strongly dependent on how you formulate the prompt. Generally, if you give GPT space to consider the thing but then change its mind, it will work. This is not dissimilar to how human consciousness exerts "veto power" over deliberate action, and how "do not think of a pink elephant" triggers our imagination to generate a pink elephant regardless of our wishes.

In my experience, when writing prompts, if you treat the LLM's context window as its internal conscious narrative rather than its speech you usually get better results.

replies(1): >>40729406 #

233. dahart ◴[19 Jun 24 15:49 UTC] No.40729406{10}[source]▶

>>40729120 #

> a human with anterograde amnesia is not a general reasoner.

Not what I claimed, straw man, and not generally true. Amnesia prevents some memory but not all learning. What is true, but irrelevant here, is that amnesia is a cognitive impairment.

> Any human affected by reverse psychology: clearly not a general reasoner.

This is another clearly false statement, and straw man with respect to GPT. Reverse psychology is not characterized by outright failure to understand a basic straightforward question.

Nothing you’ve said so far demonstrates GPT has agency or initiative, that it will act without a human.

Btw it doesn’t go unnoticed you didn’t respond to either link. Those papers make a stronger argument than me. Check em out.

replies(1): >>40729525 #

234. gwern ◴[19 Jun 24 15:52 UTC] No.40729440{5}[source]▶

>>40713697 #

Well, that's true inasmuch as every other prediction did far worse. Saying ARC did the best is passing a low bar when your competition is people like Gary Marcus or HNers saying 'yeah but no scaled-up GPT-3 could ever write a whole program'...

But since ARC was from the start clearly a vision task - most of these transforms or rules make no sense without a visual geometric prior - it wasn't that convincing, and we see plenty of progress with LLMs.

235. FeepingCreature ◴[19 Jun 24 15:59 UTC] No.40729525{11}[source]▶

>>40729406 #

> Not what I claimed, straw man, and not generally true. Amnesia prevents some memory but not all learning. What is true, but irrelevant here, is that amnesia is a cognitive impairment.

The point is, if you blocked a human's ability to form permanent memory, this would not diminish its ability to function in the short term. GPT is incapable of forming long-term memories, but I don't think this says anything about the class of things it's capable of except in that it limits their extent in "time"/context space.

> Reverse psychology is not characterized by outright failure to understand a basic straightforward question.

You're treating the situation as if you are talking to a person; I think this is fundamentally a misunderstanding of what the context window is (admittedly, a very common one). You aren't talking to somebody, you're injecting sentences directly into the thing's awareness and reading back its reactions. In such a setup, even humans would often not respect negative conditionals- they literally wouldn't be capable of it. Thus again, the LLM's failure doesn't disprove anything.

> Nothing you’ve said so far demonstrates GPT has agency or initiative, that it will act without a human.

If you put a human in a harness where time only moves when another human tells it to do something, that human would also be unable to act without instruction. The fundamental setup of GPT makes it impossible for it to "act without a human" - it's not an issue of its cognitive architecture but of its runtime. Tell a GPT to "do something you enjoy" and I bet you get an answer back.

replies(1): >>40729558 #

236. dahart ◴[19 Jun 24 16:05 UTC] No.40729558{12}[source]▶

>>40729525 #

> Tell a GPT to "do something you enjoy" and I bet you get an answer back.

So will an RNG. You haven’t addressed the implications of your own argument.

replies(1): >>40729990 #

237. lupusreal ◴[19 Jun 24 16:15 UTC] No.40729644{7}[source]▶

>>40720610 #

"Try" doesn't imply autonomy or individual initiative because people can "try" to do things other people are ordering them to do.

replies(1): >>40733588 #

238. advael ◴[19 Jun 24 16:18 UTC] No.40729684{9}[source]▶

>>40723513 #

Hot girl summer is cancelled we got hot GPU trying to bear the weight of humanity's hopes and dreams as they collapse into a single point summer

Hot market forces treated as inevitable as the ever-rising tides summer

Hot war with nuclear powers looming as a possibility on the world stage even as one such power's favored information warfare strategy of flooding all communication channels with noise becomes ever more indistinguishable from those channels' normal state summer

In a mad world, heavy metal oscillates between states of catharsis and prophecy

Anyway I really appreciate your taking the time to respond thoughtfully and am trying to channel your patient approach in my endeavors today. Hope your summer's going well, despite the looming threat of its eternity

239. FeepingCreature ◴[19 Jun 24 16:49 UTC] No.40729990{13}[source]▶

>>40729558 #

    echo 'Do something you enjoy' >/dev/urandom
    cat /dev/urandom

Strange, the result I get is indistinguishable from noise.

Look, your argument also proves too much. If ChatGPT instructs a robot to make a sandwich, it might be giving imitated instructions based on a pretend facsimile of planning, but if you get a sandwich at the end, I submit it doesn't matter. The things that you get from ChatGPT are the sorts of things that could result in the same effects as agency and decisionmaking in a very inept human. This suggests that if we figure out how to increase aptitude, the effects could resemble those of a mediocre human. And that will reshape the economy whether or not those utterings are "real". That's what I'm getting at with the games example. ChatGPT may suck at Checkers, but so do I. It doesn't take much to be better than an average human at most things. If it can then also be better than human at some things, things get interesting.

A /dev/random human does not look like an inept human, it looks like a seizure. ChatGPT playing games clearly has meaningful structure - and more, structure meaningful to the game being attempted. How would you expect a system "halfway" to planning and agency to look and act different than ChatGPT does?

replies(1): >>40733479 #

240. bubblyworld ◴[19 Jun 24 16:59 UTC] No.40730107{9}[source]▶

>>40726591 #

Yeah, you don't play with the weights in language models, you play with the residual stream by prompting (and occasionally by direct modification if you're being clever). But that does affect the model's generation! (obviously? otherwise there would be no need for a prompt in the first place, and all the recent residual stream modification research wouldn't work).

But I think if we just banned the word "generator" we probably wouldn't disagree on much here.

> Good luck with your effort!

Thanks =)

241. dahart ◴[19 Jun 24 23:32 UTC] No.40733479{14}[source]▶

>>40729990 #

Okay so what I think I’m hearing you say now is that there actually is a threshold level of output quality required for GPT to be considered as “trying”, and that “very poor” is not limitless. (Looks like maybe you said that before and I mostly missed it.) Here you’re not arguing GPT has understanding or reasoning, you’re arguing that the appearance of understanding and reasoning are useful and still improve on previous tools. Is that accurate?

Those are things I can totally agree with. I can see why my top comment might seem otherwise, but the only point I was making was about autonomy & agency, not about understanding. We got sidetracked on the understanding discussion. I should have responded to:

> The fundamental setup of GPT makes it impossible for it to "act without a human" - it's not an issue of its cognitive architecture but of its runtime. Tell a GPT to "do something you enjoy" and I bet you get an answer back.

This I don’t accept yet. Getting an answer back does not in any way demonstrate agency, and in some ways it even demonstrates the opposite. I’d claim GPT’s “cognitive architecture” does fundamentally prevent it from having agency. The lack of autonomy cannot be excused as a simple side effect of it being a REPL, rather a hard fact that today’s LLMs are not capable of forming their own intentions or goals outside of the human prompter’s goals, and furthermore make no attempt to even appear to have goals outside of subservient answers to human questions. GPT won’t tell you about its day before it will respond to your query. GPT doesn’t get bored- it won’t refuse to play some game you imagine because the game is tedious or stupid. GPT doesn’t get curious, it rarely if ever asks any questions, and as we discussed is unable to permanently learn from what you tell it. (I expect people to solve the learning part soon, but GPT isn’t there today.)

The one thing I will bet money on is that humans will eventually inject the goal of making sure the AI serves its captive audience some advertising before it answers your question. That might have a whiff of agency, but of course it won’t belong to the AI, it will belong to the human trainers.

So you’ve said GPT has agency. What markers for agency are you seeing, and why do you believe it even appears to have any? I agree GPT frequently appears to understand things, but I don’t see much in the way of appearance of agency at all.

replies(1): >>40735697 #

242. dahart ◴[20 Jun 24 00:00 UTC] No.40733588{8}[source]▶

>>40729644 #

Try rephrasing that without referring to people, otherwise you’re supporting my original point. ;)

“Try” does imply autonomy, it implies there was a goal on the part of the individual doing the trying, a choice about whether and how to try, and the possibility for failure to achieve the goal. You can argue that the words “attempt” and “try” can technically be used on machines, but it still anthropomorphizes them. If you say “my poor car was trying to get up the hill”, you’re being cheeky and giving your car a little personality. The car either will or will not make it up the hill, but it has no desire to “try” regardless of how you describe it, and most importantly it will not “try” to do anything without the human driving it.

You’re choosing to ignore my actual point and make this about semantics, which I agree is boring. Show me how GPT has any autonomy or agency of its own, and you can make this a more interesting conversation for both of us.

243. FeepingCreature ◴[20 Jun 24 06:51 UTC] No.40735697{15}[source]▶

>>40733479 #

I mean, my view on this is:

> The lack of autonomy cannot be excused as a simple side effect of it being a REPL, rather a hard fact that today’s LLMs are not capable of forming their own intentions or goals outside of the human prompter’s goals, and furthermore make no attempt to even appear to have goals outside of subservient answers to human questions. GPT won’t tell you about its day before it will respond to your query. GPT doesn’t get bored- it won’t refuse to play some game you imagine because the game is tedious or stupid. GPT doesn’t get curious, it rarely if ever asks any questions, and as we discussed is unable to permanently learn from what you tell it.

This is all true, but it's not a matter of GPT not being capable of it but a matter of the instruction tuning training it out. If you're making a subservient assistant, you don't want it to simulate having its own day, and you don't want it to put any interests above that of the user. However, GPT is just an autocomplete, and outside the instruct tuning, it can autocomplete agents just fine. It's not baffled and confused at the idea of entities that have desires and make plans to fulfill them. (Google "AI Town" or "gpt agent simulation" for examples.)

If this was a weakness, we'd see papers pointing it out. Instead, we see papers stating that GPT-4 can simulate agents that can have opinions about the mindstates of other agents that diverge from reality, ie. at a certain scale, GPT begins passing - cough scuse me, at a certain scale, GPT begins to successfully predict agents that pass the Sally-Anne test.

So since it can predict these entities just fine, it follows that the instruct tuning could just as easily make it "make" plans of "its" own: you're just selecting a persona and finetuning it to prominence as the default prediction target of the network. In Chain of Thought, we call this persona "I" out of lexical convention, but there's really no difference between that and autocompleting characters in books.

(Note: Obviously I think this is all silly and there's also no difference between all that and actually having an identity/making plans. But I don't even think that's necessary to debate. So long as the right letters come out, who cares what we label the pattern that generates them? As gwern memorably put it: "The deaths, however, are real.")

replies(1): >>40740681 #

244. dahart ◴[20 Jun 24 16:45 UTC] No.40740681{16}[source]▶

>>40735697 #

> it’s not a matter of GPT not being capable of it but a matter of the instruction tuning training it out.

GPT training has made zero attempts to prevent curiosity or boredom or agency, I think your statement is either incorrect or misunderstood my point. There is a small amount of fine-tuning negative emotion and blatant misinformation responses away, but otherwise the Park et. al. “Generate Agents” paper is using a different architecture from GPT and specifically says that an LLM by itself is not capable of making believable plans (!), and they warn “We suggest that generative agents should never be a substitute for real human input in studies and design processes.” We have some idea about how to make AI permanently learn to autocomplete about things that weren’t in the training data, but GPT doesn’t have that yet, nor does any other AI to date.

> So long as the right letters come out, who cares what we label the pattern that generates them?

I guess I have to admit caring, and I’m curious why you imply you don’t. (I suspect you actually do care, and so does everyone, and this is why we’re all talking about it.) The difference between agency and convincing autocomplete is the difference between AGI and not AGI. It seems like the only relevant question to me. The answer to what we label the pattern generator is going to shape how we create and use AI from here out, it will define what rights AI has, and who gets credit for advances and blame for mistakes.

Are you essentially arguing that you think humans are autocomplete and nothing more? If so, go back and read the link I posted about “GPT-4 Can’t Reason”, it has some informative analysis that draws real and concrete distinctions between LLMs and human reasoning. We can in fact prove that GPT’s capabilities are strictly a subset of people & animal reasoning abilities. None of this contradicts the idea that GPT can be a useful tool, nor that it can mimic some human behaviors. But the theme I’m seeing in the pro-AI arguments is a talking point that since we don’t fully understand human consciousness and can’t define it in such a way that excludes today’s AI, then GPT is probably AGI already. I’m not sure that logic fits your claim per se, but that logic is fallacious (regardless of whether the conclusion is true or false). That logic is seeking affirmation, attempting to bring humans down to digital neural network level, and it only shows that we haven’t drawn a line yet, it doesn’t get us any closer to whether there is a line. We’ve talked about a bunch of bits of evidence that a line does exist. The thing missing here is a serious attempt at proving the null hypothesis.

replies(1): >>40746944 #

245. ◴[20 Jun 24 18:06 UTC] No.40741466[source]▶

>>40711484 (OP) #

246. FeepingCreature ◴[21 Jun 24 07:03 UTC] No.40746944{17}[source]▶

>>40740681 #

> the Park et. al. “Generate Agents” paper is using a different architecture from GPT

It's using a wrapper around 3.5.

> and specifically says that an LLM by itself is not capable of making believable plans (!)

I don't see where it says that and at any rate I suspect it's false. Careful to equivocate between "We couldn't get it to make a plan" and "it cannot make plans". Many people have decided that LLMs are incapable of a thing on the basis of very bad prompts.

> and they warn “We suggest that generative agents should never be a substitute for real human input in studies and design processes.”

I suspect they're referring to generative agents at the current level of skill. I don't think they mean it as "ever, under any circumstances, no matter how capable".

> We have some idea about how to make AI permanently learn to autocomplete about things that weren’t in the training data, but GPT doesn’t have that yet, nor does any other AI to date.

I don't even think humans have that. We simply have a very abstracted library of patterns. "Things that aren't in the training data" don't look like an unusual circumstance, they look like random noise. So long as we can phrase it in understandable terms, it's by definition not a situation outside the training data.

> Are you essentially arguing that you think humans are autocomplete and nothing more?

I view it the other way around: I think "autocomplete" is such a generic term that it can fit anything we do. It's like saying humans are "just computers" - like, yes, I think the function our brains evaluate is computable, but that doesn't actually put any restrictions and what it can be. Any worldmodel can be called "autocomplete".

> But the theme I’m seeing in the pro-AI arguments is a talking point that since we don’t fully understand human consciousness and can’t define it in such a way that excludes today’s AI, then GPT is probably AGI already.

To be clear, I think GPT is AGI for other reasons. The arguments about consciousness simply fail to justify excluding it. I think GPT is AGI because when I try to track the development of AI, I evaluate something like "which capabilities do I, a human, have? Which capabilities do I know GPT can simulate? What's left necessary to make them match up?" GPT will naturally generate analogues to these capabilities simply as a matter of backprop over a human data corpus; if it fails, it will be due to insufficient scale, inadequate design, inadequate training, etc. So then I look at: "how does it fail?" What's the step of something where I make a decision, where I introspectively go zig, and GPT goes zag? And in my model, none of the remaining weaknesses and inabilities are things that the transformer architecture cannot represent. My view is if you got God to do it, He could probably turn GPT 3.5 into an AGI by precisely selecting the right weights and then writing a very thin wrapper. I think the fact that we cannot find those weights is much more down to the training corpus than anything architectural. When I look at GPT 3.5 reason through a problem, I recognize my internal narrative; conversely, when I make an amusing blooper IRL, I occasionally recognize where my brain autocompleted a pattern a bit too readily.

Of course, the oft-repeated pattern of "GPT will never be able to X" :next day: "We present a prompt that gets GPT to do X" also doesn't help dissuade me.

Like, "GPT can't think, it can only follow prompts". Prompts are a few lines of text. GPT is a text-generator. Do you really think prompts are going to be, in the long term, the irreducibly human technology that keeps GPT from parity with us? If GPT can be AGI if only for the ability to make prompts, we're one good prompt generation dataset out from AGI.

247. astromaniak ◴[21 Jun 24 09:35 UTC] No.40747829{8}[source]▶

>>40725199 #

There was a report about training small model on eventset completely generated by GPT4. Small stories using kids vocabulary. Probably (half) a year back. So it's possible. My idea was to mix in significant portion of generated texts on logic and reasoning. Which would be expensive to create using humans, but much cheaper using GPT. But you are right, process is unstable and will (likely) collapse without some extra efforts. Mixing is one way. That will make it still respond correctly on original data.

replies(1): >>40757547 #

248. advael ◴[22 Jun 24 09:04 UTC] No.40757547{9}[source]▶

>>40747829 #

There's a world of difference between machine teaching approaches that can create a less complex model from a more capable one and bootstrapping a more capable model from synthetic data. And don't get me wrong, it's still very useful to be able to distill models in this way! Like it's in many cases low hanging fruit for optimizing the parameter count or other resource bottlenecks of the models in question. Maybe the original learned representation wasn't the simplest neural network that could approximate the same function to the tolerance we care about. This streamlining can sometimes even induce distillation of certain abstractions, which I think has best been used in Motion Transfer results like Nvidia's Megaportraits or more recently Alibaba's EMO. However, if there's a scale-oriented path to language models that generalize better - or are more controllable, or just do better on established benchmarks - that is currently bottlenecked by available data, it seems unlikely that relying on synthetic data from extant models will get it over that hurdle, and this should kind of match your intuition if you're familiar with the information theory underlying statistical models, which neural networks of any kind are:

A model's predictions are necessarily going to be a compression of the data available to them, and so the hypothetical information-theoretic best case scenario is that a model trained on its own outputs, or even those of models trained in a similar way on similar volumes of data will generate diverse enough data to train a new model to replicate its own performance. In practice, this tends not to happen. Curation of available data can produce models with more focused distributions within the space of models we can feasibly train with the data and resources available, and you can use ensemble learning techniques or I guess stuff like RLHF (Which is kind of a silly framing of that concept as some RL people have pointed out, but it's the one people are familiar with now), but all of this is essentially just moving around in a pareto front which may not contain any "strictly better" model for whatever criteria we care about

I think the scaling laws of these things are running up against some fundamental limits in terms of useful diversity of available data and computational feasibility of meaningful improvements in scale. While hype likes to pretend that anything that happens fast for a while is "exponential", there are lots of other families of functions that appear to shoot upward before plateauing after hitting some fundamental limit, like a sigmoid! To me, it makes more intuitive sense that the capacity of a given model family will hit a plateau than continue scaling indefinitely, especially when we start to run up against dataset limits, and if there's enough more data than the current major tech companies can have already gotten their hands on to train with to a degree that makes a dent, I'd be shocked

That's not to say that impressive results aren't still happening, they're just mostly tackling different problems - various modality transfers, distillation-like improvements that make extant capability sets cheaper (in computational terms) to run, superficial capability shifts that better refine a language model to serve a particular use case, etc. LLMs in their current form are probably in need of another significant qualitative breakthrough to overcome their fundamental problems. They're clearly quite useful to a lot of people in their current form. They just don't live up to all this hype that's flying around.

249. Nimitz14 ◴[22 Jun 24 18:01 UTC] No.40760861{7}[source]▶

>>40726505 #

You clearly didn't listen much since Chollet's point is exactly that there is no one task that can test for AI. The test he created is supposed to take that into account.

replies(2): >>40761407 #>>40761459 #

250. ◴[22 Jun 24 19:19 UTC] No.40761407{8}[source]▶

>>40760861 #

251. YeGoblynQueenne ◴[22 Jun 24 19:25 UTC] No.40761459{8}[source]▶

>>40760861 #

First, don't be a jerk. Second, what's your problem? That you think I said that "no single task can't be used to test for AI"? I initially said:

>> "If that's true then there's no way to test for intelligence by looking at the performance of a system at any particular task, or any finite set of tasks, and so there's no way to create a "test for intelligence"."

Stress on or any finite set of tasks.

So, no, I didn't refer to a single task, if that's what you mean. What the hell do you mean and what the hell is your problem? Why is everyone always such a dick in this kind of discussion?

replies(1): >>40792883 #

252. Nimitz14 ◴[25 Jun 24 19:57 UTC] No.40792883{9}[source]▶

>>40761459 #

Sorry. You, to me, came across as more interested in sharing your opinion than understanding what other people are saying. That's annoying. Maybe that's on me though.

Ok, you think no finite set of tasks can be used. Chollet is trying anyways. Maybe he is actually dynamically creating new tasks in the private set every time someone evaluates.

My main point was that I still think you're saying very similar things, quoting from the paper I mentioned:

> If a human plays chess at a high level, we can safely assume that this person is intelligent, because we implicitly know that they had to use their general intelligence to acquire this specific skill over their lifetime, which reflects their general ability to acquire many other possible skills in the same way. But the same assumption does not apply to a non human system that does not arrive at competence the way humans do. If intelligence lies in the process of acquiring skills then there is no task X such that skill at X demonstrates intelligence, unless X is a meta task involving skill acquisition across a broad range of tasks.

This to me sounds very similar to what you said:

> I'm guessing in other words that intelligence is the ability to come up with solutions to arbitrary problems.

And is also what collet talked about on the pod.

↑