←back to thread

555 points maheshrijal | 2 comments | | HN request time: 0s | source
Show context
georgewsinger ◴[] No.43707951[source]
Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

replies(7): >>43708008 #>>43708068 #>>43708249 #>>43708545 #>>43709203 #>>43713202 #>>43716307 #
jjani ◴[] No.43708068[source]
Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.

Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.

replies(6): >>43708198 #>>43709336 #>>43710444 #>>43712513 #>>43714843 #>>43720979 #
unsupp0rted ◴[] No.43708198[source]
Main advantage over Sonnet is Gemini 2.5 doesn't try to make a bunch of unrelated changes like it's rewriting my project from scratch.
replies(6): >>43708296 #>>43708338 #>>43708390 #>>43708580 #>>43708811 #>>43709225 #
itsmevictor ◴[] No.43708296[source]
I find Gemini 2.5 truly remarkable and overall better than Claude, which I was a big fan of
replies(1): >>43708611 #
enraged_camel ◴[] No.43708611[source]
Still doesn't work well in Cursor unfortunately.
replies(3): >>43709559 #>>43710997 #>>43712870 #
1. plantain ◴[] No.43710997[source]
Working fine here. What problems do you see?
replies(1): >>43711498 #
2. michaelbarton ◴[] No.43711498[source]
Not the OP but believe they could be referring to the fact it’s not supported in edit mode yet, only agent mode.

So far for me that’s not been too much of a roadblock. Though I still find overall Gemini struggles with more obscure issues such as SQL errors in dbt