←back to thread

1503 points participant3 | 4 comments | | HN request time: 0.001s | source
Show context
MgB2 ◴[] No.43574927[source]
Idk, the models generating what are basically 1:1 copies of the training data from pretty generic descriptions feels like a severe case of overfitting to me. What use is a generational model that just regurgitates the input?

I feel like the less advanced generations, maybe even because of their limitations in terms of size, were better at coming up with something that at least feels new.

In the end, other than for copyright-washing, why wouldn't I just use the original movie still/photo in the first place?

replies(13): >>43575052 #>>43575080 #>>43575231 #>>43576085 #>>43576153 #>>43577026 #>>43577350 #>>43578381 #>>43578512 #>>43578581 #>>43579012 #>>43579408 #>>43582494 #
vjerancrnjak ◴[] No.43578512[source]
If it overfits on the whole internet then it’s like a search engine that returns really relevant results with some lossy side effect.

Recent benchmark on unseen 2025 Math Olympiad shows none of the models can problem solve . They all accidentally or on purpose had prior solutions in the training set.

replies(1): >>43578536 #
1. jks ◴[] No.43578536[source]
You probably mean the USAMO 2025 paper. They updated their comparison with Gemini 2.5 Pro, which did get a nontrivial score. That Gemini version was released five days after USAMO, so while it's not entirely impossible for the data to be in its training set, it would seem kind of unlikely.

https://x.com/mbalunovic/status/1907436704790651166

replies(3): >>43578694 #>>43578876 #>>43581515 #
2. jsemrau ◴[] No.43578694[source]
The same timing is actually suspicious. And it would not be the first time something like this happened.
3. iamacyborg ◴[] No.43578876[source]
I was noodling with Gemini 2.5 Pro a couple days ago and it was convinced Donald Trump didn’t win the 2024 election and that he conceded to Kamala Harris so I’m not entirely sure how much weight I’d put behind it.
4. MatthiasPortzel ◴[] No.43581515[source]
The claim is that these models are training on data which include the problems and explanations. The fact that the first model trained after the public release of the questions (and crowdsourced answers) performs best is not a counter example, but is expected and supported by the claim.