Recent AI model progress feels mostly like bullshit

(www.lesswrong.com)

579 points paulpauper | 2 comments | 06 Apr 25 18:01 UTC | HN request time: 0.404s | source

Show context

InkCanon ◴[06 Apr 25 20:03 UTC] No.43604503[source]▶

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #

billforsternz ◴[07 Apr 25 03:12 UTC] No.43607255[source]▶

>>43604503 #

I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages; 1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres] 2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres] 3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000] 4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]

So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.

This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.

replies(10): >>43607836 #>>43607857 #>>43607910 #>>43608930 #>>43610117 #>>43610390 #>>43611692 #>>43612201 #>>43612324 #>>43612398 #

greenmartian ◴[07 Apr 25 05:04 UTC] No.43607910[source]▶

>>43607255 #

Weird thing is, in Google AI Studio all their models—from the state-of-the-art Gemini 2.5Pro, to the lightweight Gemma 2—gave a roughly correct answer. Most even recognised the packing efficiency of spheres.

But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.

replies(4): >>43608013 #>>43609176 #>>43609774 #>>43611700 #

aurareturn ◴[07 Apr 25 05:22 UTC] No.43608013[source]▶

>>43607910 #

Makes sense that search has a small, fast, dumb model designed to summarize and not to solve problems. Nearly 14 billion Google searches per day. Way too much compute needed to use a bigger model.

replies(1): >>43608376 #

fire_lake ◴[07 Apr 25 06:17 UTC] No.43608376[source]▶

>>43608013 #

Massive search overlap though - and some questions (like the golf ball puzzle) can be cached for a long time.

replies(1): >>43609132 #

summerlight ◴[07 Apr 25 08:21 UTC] No.43609132[source]▶

>>43608376 #

AFAIK they got 15% of unseen queries everyday, so it might be not very simple to design an effective cache layer on that. Semantic-aware clustering of natural language queries and projecting them into a cache-able low rank dimension is a non-trivial problem. Of course, LLM can effectively solve that, but then what's the point of using cache when you need LLM for clustering queries...

replies(1): >>43619704 #

1. fire_lake ◴[08 Apr 25 09:11 UTC] No.43619704[source]▶

>>43609132 #

Not a search engineer, but wouldn’t a cache lookup to a previous LLM result be faster than a conventional free text search over the indexed websites? Seems like this could save money whilst delivering better results?

replies(1): >>43625227 #

2. summerlight ◴[08 Apr 25 18:48 UTC] No.43625227[source]▶

>>43619704 (TP) #

Yes, that's what Google's doing for AI overview IIUC. From what I've seen from my experiences, this is working okay and improving over time but not close to perfection. The results are stale for developing stories, some bad results are kept there for a long time, effectively same queries are returning different caches etc etc...

↑