Most active commenters
  • airstrike(3)

←back to thread

579 points paulpauper | 17 comments | | HN request time: 0.641s | source | bottom
Show context
InkCanon ◴[] No.43604503[source]
The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.
replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #
1. bglazer ◴[] No.43605451[source]
Yeah I’m a computational biology researcher. I’m working on a novel machine learning approach to inferring cellular behavior. I’m currently stumped why my algorithm won’t converge.

So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.

I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.

**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com

I promise its a fun mathematical puzzle and the biology is pretty wild too

replies(8): >>43605845 #>>43607258 #>>43607653 #>>43608731 #>>43609218 #>>43609908 #>>43615581 #>>43617498 #
2. root_axis ◴[] No.43605845[source]
It's funny, I have the same problem all the time with typical day to day programming roadblocks that these models are supposed to excel at. I'm talking about any type of bug or unexpected behavior that requires even 5 minutes of deeper analysis.

Sometimes when I'm anxious just to get on with my original task, I'll paste the code and output/errors into the LLM and iterate over its solutions, but the experience is like rolling dice, cycling through possible solutions without any kind of deductive analysis that might bring it gradually closer to a solution. If I keep asking, it eventually just starts cycling through variants of previous answers with solutions that contradict the established logic of the error/output feedback up to this point.

Not to say that the LLMs aren't productive tools, but they're more like calculators of language than agents that reason.

replies(2): >>43605981 #>>43608793 #
3. jwrallie ◴[] No.43605981[source]
True. There’s a small bonus that trying to explain the issue to the llm may sometimes be essentially rubber ducking, and that can lead to insights. I feel most of the time the llm can give erroneous output that still might trigger some thinking on a different direction, and sometimes I’m inclined to think it’s helping me more than it actually is.
4. kristianp ◴[] No.43607258[source]
Have you tried gemini 2.5? It's one of the best reasoning models. Available free in google ai studio.
5. airstrike ◴[] No.43607653[source]
I tend to prefer Claude over all things ChatGPT so maybe give the latest model a try -- although in some way I feel like 3.7 is a step down from the prior 3.5 model
replies(1): >>43620198 #
6. torginus ◴[] No.43608731[source]
When I was an undergrad EE student a decade ago, I had to tangle a lot with complex maths in my Signals & Systems, and Electricity and Magnetism classes. Stuff like Fourier transforms, hairy integrals, partial differential equations etc.

Math packages of the time like Mathematica and MATLAB helped me immensely, once you could get the problem accurately described in the correct form, they could walk through the steps and solve systems of equations, integrate tricky functions, even though AI was nowhere to be found back then.

I feel like ChatGPT is doing something similar when doing maths with its chain of thoughts method, and while its method might be somewhat more generic, I'm not sure it's strictly superior.

7. worldsayshi ◴[] No.43608793[source]
> they're more like calculators of language than agents that reason

This might be honing in on both the issue and the actual value of LLM:s. I think there's a lot of value in a "language calculator" but if it's continuously being sold as something it's not we will dismiss it or build heaps of useless apps that will just form a market bubble. I think the value is there but it's different from how we think about it.

8. MoonGhost ◴[] No.43609218[source]
I was working some time ago on image processing model using GAN architecture. One model produces output and tries to fool the second. Both are trained together. Simple, but requires a lot extra efforts to make it work. Unstable and falls apart (blows up to unrecoverable state). I found some ways to make it work by adding new loss functions, changing params, changing models' architectures and sizes. Adjusting some coefficients through the training to gradually rebalance loss functions' influence.

The same may work with you problem. If it's unstable try introduce extra 'brakes' which theoretically are not required. May be even incorrect. Whatever it is in your domain. Another thing to check is optimizer, try several. Check default parameters. I've heard Adams defaults lead to instability later in training.

PS: it would be heaven if models could work at human expert level. Not sure why some really expect this. We are just at the beginning.

PPS: the fact that they can do known tasks with minor variations is already a huge time saver.

replies(1): >>43612717 #
9. ◴[] No.43609908[source]
10. bglazer ◴[] No.43612717[source]
Yes, I suspect that engineering the loss and hyperparams could eventually get this to work. However, I was hoping the model would help me get to a more fundamental insight into why the training falls into bad minima. Like the Wasserstein GAN is a principled change to the GAN that improves stability, not just fiddling around with Adam’s beta parameter.

The reason I expected better mathematical reasoning is because the companies making them are very loudly proclaiming that these models are capable of high level mathematical reasoning.

And yes the fact I don’t have to look at matplotlib documentation anymore makes these models extremely useful already, but thats qualitatively different from having Putnam prize winning reasoning ability

replies(1): >>43617488 #
11. ◴[] No.43615581[source]
12. MoonGhost ◴[] No.43617488{3}[source]
One thing I forgot. Your solution may never converge. Like in my case with GAN after training models start wobbling around some point trying to outsmart each other. Then they _always_ explode. So, I was saving them periodically and took the best intermediate weights.
13. melagonster ◴[] No.43617498[source]
I doubt this is because his explanation is better. I tried to ask question of Calculus I, ChatGPT just repeated content from textbooks. It is useful, but people should remind that where the limitation is.
14. pdimitar ◴[] No.43620198[source]
What do you find inferior in 3.7 compared to 3.5 btw? I only recently started using Claude so I don't have a point of reference.
replies(1): >>43621964 #
15. airstrike ◴[] No.43621964{3}[source]
It's hard to say, super subjective. It's just wrong more often and sometimes it goes off in tangents wrt. what I asked. Also I might ask a question and it starts coding an entire React project. Every once in a while it will literally max out its response tokens because it can't stop writing code.

Just feels less "stable" or "tight" overall.

replies(1): >>43622207 #
16. pdimitar ◴[] No.43622207{4}[source]
I see. I have a similar feeling; as if they made it to quickly force you to pay (quickly maxing out one conversation in my case). I'm quite cynical and paranoid in this regard and I try hard not to be ruled by those two... but I can't shake the feeling that they're right this time.
replies(1): >>43622453 #
17. airstrike ◴[] No.43622453{5}[source]
I hear you but FWIW I don't think it's on purpose as it feels like an inferior product to me as a paid user