←back to thread

Getting 50% (SoTA) on Arc-AGI with GPT-4o

(redwoodresearch.substack.com)
394 points tomduncalf | 10 comments | | HN request time: 0.251s | source | bottom
1. whiplash451 ◴[] No.40715123[source]
The article jumps to the conclusion that "Given that current LLMs can perform decently well on ARC-AGI" after having used multiple hand-crafted tricks to get to these results, including "I also did a small amount of iteration on a 100 problem subset of the public test set" which is hidden in the middle of the article and not mentioned in the bullet list at the top.

Adding the close-to ad-hominem attack on Francois Chollet with the comics at the beginning (Francois never claimed to be a neuro-symbolic believer), this work does a significant disservice to the community.

replies(4): >>40715887 #>>40716039 #>>40716432 #>>40718813 #
2. z7 ◴[] No.40715887[source]
>Francois never claimed to be a neuro-symbolic believer

His response:

"This has been the most promising branch of approaches so far -- leveraging a LLM to help with discrete program search, by using the LLM as a way to sample programs or branching decisions. This is exactly what neurosymbolic AI is, for the record..."

"Deep learning-guided discrete search over program space is the approach I've been advocating, yes... there are many different flavors it could take though. This is one of them (perhaps the simplest one)."

https://x.com/fchollet/status/1802773156341641480

replies(3): >>40715928 #>>40718175 #>>40718230 #
3. YeGoblynQueenne ◴[] No.40715928[source]
That kind of neuro-symbolic AI is a bit like British cuisine: place two different things next to each other in the same plate, like bangers and mash, and call it "a dish".

Nope. This is neurosymbolic AI:

Abductive Knowledge Induction From Raw Data

https://www.doc.ic.ac.uk/~shm/Papers/abdmetarawIJCAI.pdf

That's a symbolic learning engine trained in tandem with a neural net. The symbolic engine is learning to label examples for the neural net that learns to label examples for the symbolic engine. I call that cooking!

(Full disclosure: the authors of the paper are my thesis advisor and a dear colleague).

4. bogtog ◴[] No.40716039[source]
For what it's worth, the comic is based on a well-known meme, and the author must've wanted to stick to the format: https://media.licdn.com/dms/image/D4E10AQFryt0thryEeA/image-...
5. killerstorm ◴[] No.40716432[source]
I think this work is great.

A lot of top researchers claim that obvious deficiencies in LLM training are fundamental flaws in transformer architecture, as they are interested in doing some new research.

This work show that temporary issues are temporary. E.g. LLM is not trained on grid inputs, but can figure things out after preprocessing.

replies(1): >>40718119 #
6. whiplash451 ◴[] No.40718119[source]
My claim is _not_ that this work is not useful. But however "great" your work is, misleading on the steps you took during your experiments and overselling your results is never a valid approach in research.
replies(1): >>40718747 #
7. ◴[] No.40718175[source]
8. whiplash451 ◴[] No.40718230[source]
Indeed. Francois Chollet himself said during his interview with Dwarkesh that he is not against LLMs and in fact believes that the long-term solution mixes LLMs with something else which has not been discovered yet (his bet is on discrete program search but is open to anything else).

Pitching him against LLMs in such a binary fashion is deceiving and unfair.

9. killerstorm ◴[] No.40718747{3}[source]
This is a blog post, sir. All details are written down. He's very clear about methods, it seems you're 1) biased; 2) have too high standards for blog posts.
10. kalkin ◴[] No.40718813[source]
The comic at the beginning paints the "stack more layers" LLM people as clowns, not neurosymbolic people or by proxy Chollet. Yes, it suggests the "stack more layers" approach works anyway, but in a self-deprecating way...

If this article wanted to attack Chollet, it could have made more hay out of another thing that's "hidden in the middle of the article", the note that the solution actually gets 72% on the subset of problems on which humans get ~85%. The fact that the claimed human baseline for ARC-AGI as a whole is based on an easy subset is pretty suspect.