(composio.dev)

483 points mraniki | 1 comments | 31 Mar 25 12:09 UTC | HN request time: 0.243s | source

Show context

MrScruff ◴[31 Mar 25 13:34 UTC] No.43534880[source]▶

The evidence given really doesn't justify the conclusion. Maybe it suggests 2.5 Pro might be better if you're asking it to build Javascript apps from scratch, but that hardly equates to "It's better at coding". Feels like a lot of LLM articles follow this pattern, someone running their own toy benchmarks and confidently extrapolating broad conclusions from a handful of data points. The SWE-Bench result carries a bit more weight but even that should be taken with a pinch of salt.

replies(2): >>43535050 #>>43535652 #

1. throwaway0123_5 ◴[31 Mar 25 13:49 UTC] No.43535050[source]▶

>>43534880 #

> The SWE-Bench result carries a bit more weight

Although I have issues with it (few benchmarks are perfect), I tend to agree. Gemini's 63.8 from Sonnet's 62.3 isn't a huge jump though. To Gemini's credit, it solved a bug in my PyTorch code yesterday that o1 (through the web app) couldn't (or at least didn't with my prompts).

↑

Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison