←back to thread

579 points paulpauper | 1 comments | | HN request time: 0.199s | source
Show context
lukev ◴[] No.43604244[source]
This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven.

I have my own opinions, but I can't really say that they're not also based on anecdotes and personal decision-making heuristics.

But some of us are going to end up right and some of us are going to end up wrong and I'm really curious what features signal an ability to make "better choices" w/r/t AI, even if we don't know (or can't prove) what "better" is yet.

replies(10): >>43604396 #>>43604472 #>>43604738 #>>43604923 #>>43605009 #>>43605865 #>>43606458 #>>43608665 #>>43609144 #>>43612137 #
lherron ◴[] No.43604396[source]
Agreed! And with all the gaming of the evals going on, I think we're going to be stuck with anecdotal for some time to come.

I do feel (anecdotally) that models are getting better on every major release, but the gains certainly don't seem evenly distributed.

I am hopeful the coming waves of vertical integration/guardrails/grounding applications will move us away from having to hop between models every few weeks.

replies(1): >>43604540 #
1. InkCanon ◴[] No.43604540[source]
Frankly the overarching story about evals (which receives very little coverage) is how much gaming is going on. On the recent USAMO 2025, SOTA models scored 5%, despite claiming silver/gold in IMOs. And ARC-AGI: one very easy way to "solve" it is to generate masses of synthetic examples by extrapolating the basic rules of ARC AGI questions and train it on that.