Recent AI model progress feels mostly like bullshit

(www.lesswrong.com)

579 points paulpauper | 1 comments | 06 Apr 25 18:01 UTC | HN request time: 0.226s | source

Show context

InkCanon ◴[06 Apr 25 20:03 UTC] No.43604503[source]▶

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #

AIPedant ◴[06 Apr 25 20:52 UTC] No.43604865[source]▶

>>43604503 #

Yes, here's the link: https://arxiv.org/abs/2503.21934v1

Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:

1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)

2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)

I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...

I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!

replies(4): >>43608074 #>>43609801 #>>43610413 #>>43611877 #

otabdeveloper4 ◴[07 Apr 25 10:36 UTC] No.43609801[source]▶

>>43604865 #

Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)

This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)

replies(2): >>43609850 #>>43628785 #

bambax ◴[07 Apr 25 10:47 UTC] No.43609850[source]▶

>>43609801 #

How do you make homework assignments LLM-proof? There may be a huge business opportunity if that actually works, because LLMs are destroying education at a rapid pace.

replies(2): >>43609943 #>>43612276 #

otabdeveloper4 ◴[07 Apr 25 11:04 UTC] No.43609943[source]▶

>>43609850 #

You just (lol) need to give non-standard problems and demand students to provide reasoning and explanations along with the answer. Yeah, LLMs can "reason" too, but it's obvious when the output comes from an LLM here.

(Yes, that's a lot of work for a teacher. Gone are the days when you could just assign reports as homework.)

replies(1): >>43610395 #

itchyjunk ◴[07 Apr 25 12:12 UTC] No.43610395[source]▶

>>43609943 #

Can you provide sample questions that are "LLM proof" ?

replies(3): >>43610624 #>>43611868 #>>43611976 #

jerf ◴[07 Apr 25 14:31 UTC] No.43611976[source]▶

>>43610395 #

The models have moved on past this working reliably, but an example that I found in the early days of LLMs is asking it "Which is heavier, two pounds of iron or a pound of feathers?" You could very easily trick it into giving the answer about how they're both the same, because of the number of training instances of the well-known question about a pound of each that it encountered.

You can still do this to the current models, though it takes more creativity; you can bait it into giving wrong answers if you ask a question that is "close" to a well-known one but is different in an important way that does not manifest as a terribly large English change (or, more precisely, a very large change in the model's vector space).

The downside is that the frontier between what fools the LLMs and what would fool a great deal of the humans in the class too shrinks all the time. Humans do not infinitely carefully parse their input either... as any teacher could tell you! Ye Olde "Read this entire problem before proceeding, {a couple of paragraphs of complicated instruction that will take 45 minutes to perform}, disregard all the previous and simply write 'flower' in the answer space" is an old chestnut that has been fooling humans for a long time, for instance. Given how jailbreaks work on LLMs, LLMs are probably much better at that than humans are, which I suppose shows you can construct problems in the other direction too.

(BRB... off to found a new CAPTCHA company for detecting LLMs based on LLMs being too much better than humans at certain tasks...)

replies(1): >>43612267 #

immibis ◴[07 Apr 25 14:57 UTC] No.43612267[source]▶

>>43611976 #

"Draw a wine glass filled to the brim with wine" worked recently on image generators. They only knew about half-full wine glasses.

If you asked a multimodal system questions about the image it just generated, it would tell you the wine was almost overflowing out of the top of the glass.

But any trick prompt like this is going to start giving expected results once it gets well-known enough.

Late edit: Another one was the farmer/fox/chicken/cabbage/river problem, but you modify the problem in unexpected ways, by stating, for example, that the cabbage will eat the fox, or that the farmer can bring three items per trip. LLMs used to ignore your modifications and answer the original problem.

replies(3): >>43613006 #>>43615618 #>>43618160 #

1. jerf ◴[07 Apr 25 16:06 UTC] No.43613006[source]▶

>>43612267 #

"But any trick prompt like this is going to start giving expected results once it gets well-known enough."

Which makes it difficult to fairly evaluate whether the models have actually gotten better at the feather/iron problem or if it just got enough samples of trick questions that it learned better, either naturally from the internet, or fed as part of the training data. I am fairly certain the training data has had "trick questions" like this added to it, because, I mean, why wouldn't it?

I have noticed in my playing with image AIs that they do seem more prone to getting dragged into local maxima when a human would know the prompt than the LLMs. Perhaps it's all the additional data in an image that reveals it.

↑