Recent AI model progress feels mostly like bullshit

(www.lesswrong.com)

579 points paulpauper | 1 comments | 06 Apr 25 18:01 UTC | HN request time: 0.203s | source

Show context

InkCanon ◴[06 Apr 25 20:03 UTC] No.43604503[source]▶

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #

AIPedant ◴[06 Apr 25 20:52 UTC] No.43604865[source]▶

>>43604503 #

Yes, here's the link: https://arxiv.org/abs/2503.21934v1

Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:

1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)

2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)

I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...

I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!

replies(4): >>43608074 #>>43609801 #>>43610413 #>>43611877 #

otabdeveloper4 ◴[07 Apr 25 10:36 UTC] No.43609801[source]▶

>>43604865 #

Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)

This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)

replies(2): >>43609850 #>>43628785 #

bambax ◴[07 Apr 25 10:47 UTC] No.43609850[source]▶

>>43609801 #

How do you make homework assignments LLM-proof? There may be a huge business opportunity if that actually works, because LLMs are destroying education at a rapid pace.

replies(2): >>43609943 #>>43612276 #

otabdeveloper4 ◴[07 Apr 25 11:04 UTC] No.43609943[source]▶

>>43609850 #

You just (lol) need to give non-standard problems and demand students to provide reasoning and explanations along with the answer. Yeah, LLMs can "reason" too, but it's obvious when the output comes from an LLM here.

(Yes, that's a lot of work for a teacher. Gone are the days when you could just assign reports as homework.)

replies(1): >>43610395 #

itchyjunk ◴[07 Apr 25 12:12 UTC] No.43610395[source]▶

>>43609943 #

Can you provide sample questions that are "LLM proof" ?

replies(3): >>43610624 #>>43611868 #>>43611976 #

jerf ◴[07 Apr 25 14:31 UTC] No.43611976[source]▶

>>43610395 #

The models have moved on past this working reliably, but an example that I found in the early days of LLMs is asking it "Which is heavier, two pounds of iron or a pound of feathers?" You could very easily trick it into giving the answer about how they're both the same, because of the number of training instances of the well-known question about a pound of each that it encountered.

You can still do this to the current models, though it takes more creativity; you can bait it into giving wrong answers if you ask a question that is "close" to a well-known one but is different in an important way that does not manifest as a terribly large English change (or, more precisely, a very large change in the model's vector space).

The downside is that the frontier between what fools the LLMs and what would fool a great deal of the humans in the class too shrinks all the time. Humans do not infinitely carefully parse their input either... as any teacher could tell you! Ye Olde "Read this entire problem before proceeding, {a couple of paragraphs of complicated instruction that will take 45 minutes to perform}, disregard all the previous and simply write 'flower' in the answer space" is an old chestnut that has been fooling humans for a long time, for instance. Given how jailbreaks work on LLMs, LLMs are probably much better at that than humans are, which I suppose shows you can construct problems in the other direction too.

(BRB... off to found a new CAPTCHA company for detecting LLMs based on LLMs being too much better than humans at certain tasks...)

replies(1): >>43612267 #

immibis ◴[07 Apr 25 14:57 UTC] No.43612267[source]▶

>>43611976 #

"Draw a wine glass filled to the brim with wine" worked recently on image generators. They only knew about half-full wine glasses.

If you asked a multimodal system questions about the image it just generated, it would tell you the wine was almost overflowing out of the top of the glass.

But any trick prompt like this is going to start giving expected results once it gets well-known enough.

Late edit: Another one was the farmer/fox/chicken/cabbage/river problem, but you modify the problem in unexpected ways, by stating, for example, that the cabbage will eat the fox, or that the farmer can bring three items per trip. LLMs used to ignore your modifications and answer the original problem.

replies(3): >>43613006 #>>43615618 #>>43618160 #

1. int_19h ◴[08 Apr 25 03:40 UTC] No.43618160[source]▶

>>43612267 #

> Another one was the farmer/fox/chicken/cabbage/river problem, but you modify the problem in unexpected ways, by stating, for example, that the cabbage will eat the fox, or that the farmer can bring three items per trip. LLMs used to ignore your modifications and answer the original problem.

This is still the case. Very few non-reasoning models can solve such variations correctly, even SOTA models. Worse yet, not only they confidently give wrong responses, but they often do so even when specifically told to use CoT, and they continue giving wrong answers in a loop even if you specifically point out where they are wrong.

Reasoning models do much better, though. E.g. QwQ-32b can solve it pretty reliably, although it takes a lot of tokens for it to explore the possibilities. But at least it can fairly consistently tell when it's doing something wrong and then backtrack.

One other example that befuddles even the reasoning models is frying-cubes-in-a-pan and equivalents, e.g. this version from Simple Bench:

> Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute? Pick the most realistic answer option. A) 5 B) 11 C) 0 D) 20

↑