←back to thread

483 points mraniki | 1 comments | | HN request time: 0s | source
Show context
MrScruff ◴[] No.43534880[source]
The evidence given really doesn't justify the conclusion. Maybe it suggests 2.5 Pro might be better if you're asking it to build Javascript apps from scratch, but that hardly equates to "It's better at coding". Feels like a lot of LLM articles follow this pattern, someone running their own toy benchmarks and confidently extrapolating broad conclusions from a handful of data points. The SWE-Bench result carries a bit more weight but even that should be taken with a pinch of salt.
replies(2): >>43535050 #>>43535652 #
1. throwaway0123_5 ◴[] No.43535050[source]
> The SWE-Bench result carries a bit more weight

Although I have issues with it (few benchmarks are perfect), I tend to agree. Gemini's 63.8 from Sonnet's 62.3 isn't a huge jump though. To Gemini's credit, it solved a bug in my PyTorch code yesterday that o1 (through the web app) couldn't (or at least didn't with my prompts).