Recent AI model progress feels mostly like bullshit

(www.lesswrong.com)

579 points paulpauper | 1 comments | 06 Apr 25 18:01 UTC | HN request time: 0.336s | source

Show context

InkCanon ◴[06 Apr 25 20:03 UTC] No.43604503[source]▶

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #

AIPedant ◴[06 Apr 25 20:52 UTC] No.43604865[source]▶

>>43604503 #

Yes, here's the link: https://arxiv.org/abs/2503.21934v1

Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:

1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)

2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)

I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...

I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!

replies(4): >>43608074 #>>43609801 #>>43610413 #>>43611877 #

otabdeveloper4 ◴[07 Apr 25 10:36 UTC] No.43609801[source]▶

>>43604865 #

Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)

This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)

replies(2): >>43609850 #>>43628785 #

bambax ◴[07 Apr 25 10:47 UTC] No.43609850[source]▶

>>43609801 #

How do you make homework assignments LLM-proof? There may be a huge business opportunity if that actually works, because LLMs are destroying education at a rapid pace.

replies(2): >>43609943 #>>43612276 #

hyperbovine ◴[07 Apr 25 14:58 UTC] No.43612276[source]▶

>>43609850 #

By giving pen and paper exams and telling your students that the only viable preparation strategy is doing the hw assignments themselves :)

replies(3): >>43613586 #>>43614742 #>>43620870 #

1. econ ◴[07 Apr 25 19:03 UTC] No.43614742[source]▶

>>43612276 #

Or you simply account for it and provide equally challenging tasks adjusted for the tools of the time. Give them access to the best LLMs money can buy.

After all, they will grow up next to these things. They will do the homework today, by the time they graduate the LLM will take their job. There might be human large langage model managers for a while, soon to be replaced by the age of idea men.

↑