Recent AI model progress feels mostly like bullshit

(www.lesswrong.com)

579 points paulpauper | 1 comments | 06 Apr 25 18:01 UTC | HN request time: 0s | source

Show context

InkCanon ◴[06 Apr 25 20:03 UTC] No.43604503[source]▶

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #

billforsternz ◴[07 Apr 25 03:12 UTC] No.43607255[source]▶

>>43604503 #

I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages; 1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres] 2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres] 3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000] 4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]

So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.

This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.

replies(10): >>43607836 #>>43607857 #>>43607910 #>>43608930 #>>43610117 #>>43610390 #>>43611692 #>>43612201 #>>43612324 #>>43612398 #

CivBase ◴[07 Apr 25 14:51 UTC] No.43612201[source]▶

>>43607255 #

I just asked my company-approved AI chatbot the same question.

It got the golf ball volume right (0.00004068 cubic meters), but it still overestimated the cabin volume at 1000 cubic meters.

It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?

It didn't acknowledge other items in the cabin (like seats) reducing its volume, but it did at least acknowlesge inefficiencies in packing spherical objects and suggested the actual number would be "somewhat lower", though it did not offer an estimate.

When I pressed it for an estimate, it used a packing density of 74% and gave an estimate of 18,191,766 golf balls. That's one more than the calculation should have produced, but arguably insignificant in context.

Next I asked it to account for fixtures in the cabin such as seats. It estimated a 30% reduction in cabin volume and redid the calculations with a cabin volume of 700 cubic meters. These calculations were much less accurate. It told me 700 ÷ 0.00004068 = 17,201,480 (off by ~6k). And it told me 17,201,480 × 0.74 was 12,728,096 (off by ~1k).

I told it the calculations were wrong and to try again, but it produced the same numbers. Then I gave it the correct answer for 700 ÷ 0.00004068. It told me I was correct and redid the last calculation correctly using the value I provided.

Of all the things for an AI chatbot which can supposedly "reason" to fail at, I didn't expect it to be basic arithmetic. The one I used was closer, but it was still off by a lot at times despite the calculations being simple multiplication and division. Even if might not matter in the context of filling an air plane cabin with golf balls, it does not inspire trust for more serious questions.

replies(1): >>43617479 #

billforsternz ◴[08 Apr 25 01:17 UTC] No.43617479[source]▶

>>43612201 #

> It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?

1000 ÷ 0.00004068 = 25,000,000. I think this is an important point that's increasingly widely misunderstood. All those extra digits you show are just meaningless noise and should be ruthlessly eliminated. If 1000 cubic metres in this context really meant 1000.000 cubic metres, then by all means show maybe the four digits of precision you get from the golf ball (but I am more inclined to think 1000 cubic metres is actually the roughest of rough approximations, with just one digit of precision).

In other words, I don't fault the AI for mismatching one set of meaninglessly precise digits for another, but I do fault it for using meaninglessly precise digits in the first place.

replies(1): >>43617644 #

CivBase ◴[08 Apr 25 01:50 UTC] No.43617644{3}[source]▶

>>43617479 #

I agree those digits are not significant in the context of the question asked. But if the AI is going to use that level of precision in the answer, I expect it to be correct.

replies(1): >>43639574 #

1. billforsternz ◴[10 Apr 25 00:52 UTC] No.43639574{4}[source]▶

>>43617644 #

Fair enough, I agree, simple arithmetic calculations shouldn't generate mysterious answers.

↑