Francois Chollet is leaving Google

1. max_ ◴[13 Nov 24 23:19 UTC] No.42131308[source]▶

>>42130881 (OP) #

I wonder what he will be working on?

Maybe he figured out a model that beats ARC-AGI by 85%?

replies(1): >>42131784 #

2. trott ◴[14 Nov 24 00:26 UTC] No.42131784[source]▶

>>42131308 (TP) #

> Maybe he figured out a model that beats ARC-AGI by 85%?

People have, I think.

One of the published approaches (BARC) uses GPT-4o to generate a lot more training data.

The approach is scaling really well so far [1], and whether you expect linear scaling or exponential one [2], the 85% threshold can be reached, using the "transduction" model alone, after generating under 2 million tasks ($20K in OpenAI credits).

Perhaps for 2025, the organizers will redesign ARC-AGI to be more resistant to this sort of approach, somehow.

---

[1] https://www.kaggle.com/competitions/arc-prize-2024/discussio...

[2] If you are "throwing darts at a board", you get exponential scaling (the probability of not hitting bullseye at least once reduces exponentially with the number of throws). If you deliberately design your synthetic dataset to be non-redundant, you might get something akin to linear scaling (until you hit perfect accuracy, of course).

replies(4): >>42131848 #>>42132132 #>>42132502 #>>42132655 #

3. thrw42A8N ◴[14 Nov 24 00:35 UTC] No.42131848[source]▶

>>42131784 #

> If you are "throwing darts at a board", you get exponential scaling (the probability of not hitting bullseye reduces exponentially with the number of throws).

Honest question - is that so, and why? I thought you have to calculate the probability of each throw individually as nothing fundamentally connects the throws together, only that long term there will be a normal distribution of randomness.

replies(1): >>42131877 #

4. trott ◴[14 Nov 24 00:38 UTC] No.42131877{3}[source]▶

>>42131848 #

> The probability of not hitting bullseye at least once ...

I added a clarification.

5. fastball ◴[14 Nov 24 01:16 UTC] No.42132132[source]▶

>>42131784 #

I like the idea of ARC-AGI and think it was worth a shot. But if someone has already hit the human-level threshold, I think the entire idea can be thrown out.

If the ARC-AGI challenge did not actually follow their expected graph[1], I see no reason to believe that any benchmark can be designed in a way where it cannot be gamed. Rather, it seems that the existing SOTA models just weren't well-optimized for that one task.

The only way to measure "AGI" is in however you define the "G". If your model can only do one thing, it is not AGI and doesn't really indicate you are closer, even if you very carefully designed your challenge.

[1] https://static.supernotes.app/ai-benchmarks-2.png

replies(3): >>42132191 #>>42132203 #>>42132310 #

6. TheDudeMan ◴[14 Nov 24 01:27 UTC] No.42132191{3}[source]▶

>>42132132 #

What you're calling "gamed" could actually be research and progress in general problem solving.

replies(1): >>42132596 #

7. nl ◴[14 Nov 24 01:28 UTC] No.42132203{3}[source]▶

>>42132132 #

> The only way to measure "AGI" is in however you define the "G"

"I" isn't usefully defined either.

At least most people agree on "Artificial"

replies(1): >>42133124 #

8. trott ◴[14 Nov 24 01:47 UTC] No.42132310{3}[source]▶

>>42132132 #

> But if someone has already hit the human-level threshold

There is some controversy over what the human-level threshold is. A recent and very extensive study measured just 60.2% using Amazon Mechanical Turkers, for the same setup [1].

But the Turkers had no prior experience with the dataset, and were only given 5 tasks each.

Regardless, I believe ARC-AGI should aim for a higher threshold than what average humans achieve, because the ultimate goal of AGI is to supplement or replace high-IQ experts (who tend to do very well on ARC)

---

[1] Table 1 in https://arxiv.org/abs/2409.01374 2-shot Evaluation Set

replies(1): >>42137325 #

9. mxwsn ◴[14 Nov 24 02:19 UTC] No.42132502[source]▶

>>42131784 #

My interest was piqued, but the extrapolation in [1] is uh... not the most convincing. If there were more data points then sure, maybe

replies(1): >>42132594 #

10. trott ◴[14 Nov 24 02:38 UTC] No.42132594{3}[source]▶

>>42132502 #

The plot was just showing where the solid lines were trending (see prior messages), and that happened to predict the performance at 400k samples (red dot) very well.

An exponential scaling curve would steer a bit more to the right, but it would still cross the 85% mark before 2000k.

11. fastball ◴[14 Nov 24 02:38 UTC] No.42132596{4}[source]▶

>>42132191 #

Almost by definition it is not. If you are "gaming" a specific benchmark, what you have is not progress in general intelligence. The entire premise of the ARC-AGI challenge was that general problem solving would be required. As noted by the GP, one of the top contenders is BARC which performs well by generating a huge amount of training data for this particular problem. That's not general intelligence, that's gaming.

There is no reason to believe that technique would not work for any particular problem. After all, this problem was the best attempt the (very intelligent) challenge designers could come up with, as evidenced by putting $1m on the line.

replies(1): >>42132696 #

12. TechDebtDevin ◴[14 Nov 24 02:49 UTC] No.42132655[source]▶

>>42131784 #

I personally think ARC-AGI will be a forgotten, unimportant benchmark that doesn't indicate anything more than a models ability reason, which honestly is just a very small step in the path towards AGI

13. trott ◴[14 Nov 24 02:58 UTC] No.42132696{5}[source]▶

>>42132596 #

> That's not general intelligence, that's gaming.

In fairness, their approach is non-trivial. Simply asking GPT-4o to fantasize more examples wouldn't have worked very well. Instead, they have it fantasize inputs and programs, and then run the programs on the inputs to compute the outputs.

I think it's a great contribution (although I'm surprised they didn't try making an even bigger dataset -- perhaps they ran out of time or funding)

14. echelon ◴[14 Nov 24 04:18 UTC] No.42133124{4}[source]▶

>>42132203 #

That's the problem with intelligence vs the other things we're doing with deep learning.

Vision models, image models, video models, audio models? Solved. We've understood the physics of optics and audio for over half a century. We've had ray tracers for forever. It's all well understood, and now we're teaching models to understand it.

Intelligence? We can't even describe our own.

15. aithrowawaycomm ◴[14 Nov 24 15:44 UTC] No.42137325{4}[source]▶

>>42132310 #

It is scientific malpractice to use Mechanical Turk to establish a human-level baseline for cognitively-demanding tasks, even if you ignore the issue of people outsourcing tasks to ChatGPT. The pay is abysmal and if it seems like the task is purely academic and hence part of a study, there is almost no incentive to put in effort: researchers won't deny payment for a bad answer. Since you get paid either way, there is a strong incentive to quickly give up thinking about a tricky ARC problem and simply guess a solution. (IQ tests in general have this problem: cynicism and laziness are indistinguishable from actual mistakes.)

Note that across all MTurk workers, 790/800 of evaluation tasks were successfully completed. I think 98% is actually a better number for human performance than 60%, as a proxy for "how well would a single human of above-average intelligence perform if they put maximal effort into each question?" It is an overestimate, but 60% is a vast underestimate.