Recent AI model progress feels mostly like bullshit

(www.lesswrong.com)

579 points paulpauper | 5 comments | 06 Apr 25 18:01 UTC | HN request time: 1.201s | source

Show context

InkCanon ◴[06 Apr 25 20:03 UTC] No.43604503[source]▶

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #

bglazer ◴[06 Apr 25 22:20 UTC] No.43605451[source]▶

>>43604503 #

Yeah I’m a computational biology researcher. I’m working on a novel machine learning approach to inferring cellular behavior. I’m currently stumped why my algorithm won’t converge.

So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.

I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.

**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com

I promise its a fun mathematical puzzle and the biology is pretty wild too

replies(8): >>43605845 #>>43607258 #>>43607653 #>>43608731 #>>43609218 #>>43609908 #>>43615581 #>>43617498 #

1. airstrike ◴[07 Apr 25 04:22 UTC] No.43607653[source]▶

>>43605451 #

I tend to prefer Claude over all things ChatGPT so maybe give the latest model a try -- although in some way I feel like 3.7 is a step down from the prior 3.5 model

replies(1): >>43620198 #

2. pdimitar ◴[08 Apr 25 10:45 UTC] No.43620198[source]▶

>>43607653 (TP) #

What do you find inferior in 3.7 compared to 3.5 btw? I only recently started using Claude so I don't have a point of reference.

replies(1): >>43621964 #

3. airstrike ◴[08 Apr 25 14:08 UTC] No.43621964[source]▶

>>43620198 #

It's hard to say, super subjective. It's just wrong more often and sometimes it goes off in tangents wrt. what I asked. Also I might ask a question and it starts coding an entire React project. Every once in a while it will literally max out its response tokens because it can't stop writing code.

Just feels less "stable" or "tight" overall.

replies(1): >>43622207 #

4. pdimitar ◴[08 Apr 25 14:28 UTC] No.43622207{3}[source]▶

>>43621964 #

I see. I have a similar feeling; as if they made it to quickly force you to pay (quickly maxing out one conversation in my case). I'm quite cynical and paranoid in this regard and I try hard not to be ruled by those two... but I can't shake the feeling that they're right this time.

replies(1): >>43622453 #

5. airstrike ◴[08 Apr 25 14:50 UTC] No.43622453{4}[source]▶

>>43622207 #

I hear you but FWIW I don't think it's on purpose as it feels like an inferior product to me as a paid user

↑