Most active commenters

otabdeveloper4(3)

Popular/hot comments

>>43610395 #
>>43610413 #
>>43611877 #
>>43612267 #
>>43612276 #

←back to thread

Recent AI model progress feels mostly like bullshit

(www.lesswrong.com)

Show context

InkCanon ◴[06 Apr 25 20:03 UTC] No.43604503[source]▶

>>43603453 (OP) #

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #

1. AIPedant ◴[06 Apr 25 20:52 UTC] No.43604865[source]▶

>>43604503 #

Yes, here's the link: https://arxiv.org/abs/2503.21934v1

Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:

1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)

2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)

I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...

I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!

replies(4): >>43608074 #>>43609801 #>>43610413 #>>43611877 #

2. JohnKemeny ◴[07 Apr 25 05:30 UTC] No.43608074[source]▶

>>43604865 (TP) #

Discussed here: https://news.ycombinator.com/item?id=43540985 (Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad, 4 points, 2 comments).

3. otabdeveloper4 ◴[07 Apr 25 10:36 UTC] No.43609801[source]▶

>>43604865 (TP) #

Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)

This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)

replies(2): >>43609850 #>>43628785 #

4. bambax ◴[07 Apr 25 10:47 UTC] No.43609850[source]▶

>>43609801 #

How do you make homework assignments LLM-proof? There may be a huge business opportunity if that actually works, because LLMs are destroying education at a rapid pace.

replies(2): >>43609943 #>>43612276 #

5. otabdeveloper4 ◴[07 Apr 25 11:04 UTC] No.43609943{3}[source]▶

>>43609850 #

You just (lol) need to give non-standard problems and demand students to provide reasoning and explanations along with the answer. Yeah, LLMs can "reason" too, but it's obvious when the output comes from an LLM here.

(Yes, that's a lot of work for a teacher. Gone are the days when you could just assign reports as homework.)

replies(1): >>43610395 #

6. itchyjunk ◴[07 Apr 25 12:12 UTC] No.43610395{4}[source]▶

>>43609943 #

Can you provide sample questions that are "LLM proof" ?

replies(3): >>43610624 #>>43611868 #>>43611976 #

7. larodi ◴[07 Apr 25 12:14 UTC] No.43610413[source]▶

>>43604865 (TP) #

This is a paper by INSAIT researchers - a very young institute which hired most of its PHD staff only in the last 2 years, basically onboarding anyone who wanted to be part of it. They were waiving their BG-GPT on national TV in the country as a major breakthrough, while it was basically was a Mistral fine-tuned model, that was eventually never released to the public, nor the training set.

Not sure whether their (INSAIT's) agenda is purely scientific, as there's a lot of PR on linkedin by these guys, literally celebrating every PHD they get, which is at minimum very weird. I'd take anything they release with a grain of sand if not caution.

replies(3): >>43610872 #>>43614143 #>>43617257 #

8. otabdeveloper4 ◴[07 Apr 25 12:42 UTC] No.43610624{5}[source]▶

>>43610395 #

It's not about being "LLM-proff", it's about teacher involvement in making up novel questions and grading attentively. There's no magic trick.

9. xeromal ◴[07 Apr 25 14:22 UTC] No.43611868{5}[source]▶

>>43610395 #

Part of the proof is knowing your students and forcing an answer that will rat out whether they used an LLM. There is no universal question and it requires personal knowledge of each student. You're looking for something that doesn't exist.

10. apercu ◴[07 Apr 25 14:22 UTC] No.43611877[source]▶

>>43604865 (TP) #

In my experience LLMs can't get basic western music theory right, there's no way I would use an LLM for something harder than that.

replies(3): >>43613409 #>>43622057 #>>43628772 #

11. jerf ◴[07 Apr 25 14:31 UTC] No.43611976{5}[source]▶

>>43610395 #

The models have moved on past this working reliably, but an example that I found in the early days of LLMs is asking it "Which is heavier, two pounds of iron or a pound of feathers?" You could very easily trick it into giving the answer about how they're both the same, because of the number of training instances of the well-known question about a pound of each that it encountered.

You can still do this to the current models, though it takes more creativity; you can bait it into giving wrong answers if you ask a question that is "close" to a well-known one but is different in an important way that does not manifest as a terribly large English change (or, more precisely, a very large change in the model's vector space).

The downside is that the frontier between what fools the LLMs and what would fool a great deal of the humans in the class too shrinks all the time. Humans do not infinitely carefully parse their input either... as any teacher could tell you! Ye Olde "Read this entire problem before proceeding, {a couple of paragraphs of complicated instruction that will take 45 minutes to perform}, disregard all the previous and simply write 'flower' in the answer space" is an old chestnut that has been fooling humans for a long time, for instance. Given how jailbreaks work on LLMs, LLMs are probably much better at that than humans are, which I suppose shows you can construct problems in the other direction too.

(BRB... off to found a new CAPTCHA company for detecting LLMs based on LLMs being too much better than humans at certain tasks...)

replies(1): >>43612267 #

12. immibis ◴[07 Apr 25 14:57 UTC] No.43612267{6}[source]▶

>>43611976 #

"Draw a wine glass filled to the brim with wine" worked recently on image generators. They only knew about half-full wine glasses.

If you asked a multimodal system questions about the image it just generated, it would tell you the wine was almost overflowing out of the top of the glass.

But any trick prompt like this is going to start giving expected results once it gets well-known enough.

Late edit: Another one was the farmer/fox/chicken/cabbage/river problem, but you modify the problem in unexpected ways, by stating, for example, that the cabbage will eat the fox, or that the farmer can bring three items per trip. LLMs used to ignore your modifications and answer the original problem.

replies(3): >>43613006 #>>43615618 #>>43618160 #

13. hyperbovine ◴[07 Apr 25 14:58 UTC] No.43612276{3}[source]▶

>>43609850 #

By giving pen and paper exams and telling your students that the only viable preparation strategy is doing the hw assignments themselves :)

replies(3): >>43613586 #>>43614742 #>>43620870 #

14. jerf ◴[07 Apr 25 16:06 UTC] No.43613006{7}[source]▶

>>43612267 #

"But any trick prompt like this is going to start giving expected results once it gets well-known enough."

Which makes it difficult to fairly evaluate whether the models have actually gotten better at the feather/iron problem or if it just got enough samples of trick questions that it learned better, either naturally from the internet, or fed as part of the training data. I am fairly certain the training data has had "trick questions" like this added to it, because, I mean, why wouldn't it?

I have noticed in my playing with image AIs that they do seem more prone to getting dragged into local maxima when a human would know the prompt than the LLMs. Perhaps it's all the additional data in an image that reveals it.

15. waffletower ◴[07 Apr 25 16:48 UTC] No.43613409[source]▶

>>43611877 #

While I may be mistaken, but I don't believe that LLMs are trained on a large corpus of machine readable music representations, which would arguably be crucial to strong performance in common practice music theory. I would also surmise that most music theory related datasets largely arrive without musical representations altogether. A similar problem exists for many other fields, particularly mathematics, but it is much more profitable to invest the effort to span such representation gaps for them. I would not gauge LLM generality on music theory performance, when its niche representations are likely unavailable in training and it is widely perceived as having miniscule economic value.

16. bambax ◴[07 Apr 25 17:02 UTC] No.43613586{4}[source]▶

>>43612276 #

You wish. I used to think that too. But it turns out, nowadays, every single exam in person is done with a phone hidden somewhere, with various efficiency, and you can't really strip students before they enter the room.

Some teachers try to collect the phones beforehand, but then students simply give out older phones and keep their active ones with them.

You could try to verify that the phones they're giving out are working by calling them, but that would take an enormous amount of time and it's impractical for simple exams.

We really have no idea how much AI is ruining education right now.

replies(1): >>43615673 #

17. ◴[07 Apr 25 17:58 UTC] No.43614143[source]▶

>>43610413 #

18. econ ◴[07 Apr 25 19:03 UTC] No.43614742{4}[source]▶

>>43612276 #

Or you simply account for it and provide equally challenging tasks adjusted for the tools of the time. Give them access to the best LLMs money can buy.

After all, they will grow up next to these things. They will do the homework today, by the time they graduate the LLM will take their job. There might be human large langage model managers for a while, soon to be replaced by the age of idea men.

19. achierius ◴[07 Apr 25 20:35 UTC] No.43615673{5}[source]▶

>>43613586 #

Unlike the hard problem of "making an exam difficult to take when you have access to an LLM", "making sure students don't have devices on them when they take one" is very tractable, even if teachers are going to need some time to catch up with the curve.

Any of the following could work, though the specific tradeoffs & implementation details do vary:

- have <n> teachers walking around the room to watch for cheaters

- mount a few cameras to various points in the room and give the teacher a dashboard so that they can watch from all angles

- record from above and use AI to flag potential cheaters for manual review

- disable Wi-Fi + activate cell jammers during exam time (with a land-line in the room in case of emergencies?)

- build dedicated examination rooms lined with metal mesh to disrupt cell reception

So unlike "beating LLMs" (where it's an open question as to whether it's even possible, and a moving target to boot), barring serious advances in wearable technology this just seems like a question of funding and therefore political will.

replies(2): >>43615891 #>>43617261 #

20. atiedebee ◴[07 Apr 25 21:00 UTC] No.43615891{6}[source]▶

>>43615673 #

Cell jammers sound like they could be a security risk. In the context of highschool, it is generally very easy to see when someone is on their phone.

21. sealeck ◴[08 Apr 25 00:28 UTC] No.43617257[source]▶

>>43610413 #

Half the researchers are at ETH Zurich (INSAIT is a partnership between EPFL, ETH and Sofia) - hardly an unreliable institution.

22. sealeck ◴[08 Apr 25 00:29 UTC] No.43617261{6}[source]▶

>>43615673 #

Infrared camera should do the trick.

23. int_19h ◴[08 Apr 25 03:40 UTC] No.43618160{7}[source]▶

>>43612267 #

> Another one was the farmer/fox/chicken/cabbage/river problem, but you modify the problem in unexpected ways, by stating, for example, that the cabbage will eat the fox, or that the farmer can bring three items per trip. LLMs used to ignore your modifications and answer the original problem.

This is still the case. Very few non-reasoning models can solve such variations correctly, even SOTA models. Worse yet, not only they confidently give wrong responses, but they often do so even when specifically told to use CoT, and they continue giving wrong answers in a loop even if you specifically point out where they are wrong.

Reasoning models do much better, though. E.g. QwQ-32b can solve it pretty reliably, although it takes a lot of tokens for it to explore the possibilities. But at least it can fairly consistently tell when it's doing something wrong and then backtrack.

One other example that befuddles even the reasoning models is frying-cubes-in-a-pan and equivalents, e.g. this version from Simple Bench:

> Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute? Pick the most realistic answer option. A) 5 B) 11 C) 0 D) 20

24. billy99k ◴[08 Apr 25 12:23 UTC] No.43620870{4}[source]▶

>>43612276 #

Making in-person tests the only thing that counts toward your grade seems to be a step in the right direction. If students use AI to do their homework, it will only hurt them in the long run.

25. code_for_monkey ◴[08 Apr 25 14:15 UTC] No.43622057[source]▶

>>43611877 #

music theory is a really good test because in my experience the AI is extremely bad at it

26. motorest ◴[09 Apr 25 04:00 UTC] No.43628772[source]▶

>>43611877 #

> In my experience LLMs can't get basic western music theory right, there's no way I would use an LLM for something harder than that.

This take is completely oblivious, and frankly sounds like a desperate jab. There are a myriad of activities whose core requirement is a) derive info from a complex context which happens to be supported by a deep and plentiful corpus, b) employ glorified template and rule engines.

LLMs excel at what might be described as interpolating context following input and output in natural language. As in a chatbot that is extensivey trained in domain-specific tasks, which can also parse and generate content. There is absolutely zero lines of intellectual work that do not benefit extensively from this sort of tool. Zero.

replies(1): >>43642753 #

27. motorest ◴[09 Apr 25 04:03 UTC] No.43628785[source]▶

>>43609801 #

> This effectively makes LLMs useless for education.

No. You're only arguing LLMs are useless at regurgitating homework assignments to allow students to avoid doing it.

The point of education is not mindless doing homework.

28. apercu ◴[10 Apr 25 11:17 UTC] No.43642753{3}[source]▶

>>43628772 #

A desperate jab? But I _want_ LLM's to be able to do basic, deterministic things accurately. Seems like I touched a nerve? Lol.

↑