←back to thread

555 points maheshrijal | 1 comments | | HN request time: 0.209s | source
Show context
georgewsinger ◴[] No.43707951[source]
Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

replies(7): >>43708008 #>>43708068 #>>43708249 #>>43708545 #>>43709203 #>>43713202 #>>43716307 #
jjani ◴[] No.43708068[source]
Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.

Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.

replies(6): >>43708198 #>>43709336 #>>43710444 #>>43712513 #>>43714843 #>>43720979 #
unsupp0rted ◴[] No.43708198[source]
Main advantage over Sonnet is Gemini 2.5 doesn't try to make a bunch of unrelated changes like it's rewriting my project from scratch.
replies(6): >>43708296 #>>43708338 #>>43708390 #>>43708580 #>>43708811 #>>43709225 #
erikw ◴[] No.43709225[source]
What language / framework are you using? I ask because in a Node / Typescript / React project I experience the opposite- Claude 3.7 usually solves my query on the first try, and seems to understand the project's context, ie the file structure, packages, coding guidelines, tests, etc, while Gemini 2.5 seems to install packages willy-nilly, duplicate existing tests, create duplicate components, etc.
replies(1): >>43715141 #
1. unsupp0rted ◴[] No.43715141[source]
Node / Vue