←back to thread

2127 points bakugo | 1 comments | | HN request time: 0s | source
Show context
anotherpaulg ◴[] No.43164684[source]
Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.

Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.

Aider 0.75.0 is out with support for 3.7 Sonnet [1].

Thinking support and thinking benchmark results coming soon.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html#aider-v0750

replies(18): >>43164827 #>>43165382 #>>43165504 #>>43165555 #>>43165786 #>>43166186 #>>43166253 #>>43166387 #>>43166478 #>>43166688 #>>43166754 #>>43166976 #>>43167970 #>>43170020 #>>43172076 #>>43173004 #>>43173088 #>>43176914 #
doctoboggan ◴[] No.43167970[source]
Have you tried Claude 3.7 + Deepseek as the architect? Seeing as "DeepSeek R1 + claude-3-5-sonnet-20241022" is the second place option, "DeepSeek R1 + claude-3-7" would hopefully be the highest ranking choice so far?
replies(1): >>43168426 #
SparkyMcUnicorn ◴[] No.43168426[source]
It looks like Sonnet 3.7 (extended thinking) would be a better architect than R1.

I'll be trying out Sonnet 3.7 extended thinking + Sonnet 3.5 or Flash 2.0, which I assume would be at the top of the leaderboard.

replies(1): >>43178388 #
attentive ◴[] No.43178388[source]
given 3.5 and 3.7 cost the same, it doesn't make sense to use 3.5 here.

I'd like to see that benchmark, but R1 + 3.7 should be cheaper than 3.7T + 3.7

replies(1): >>43178578 #
1. SparkyMcUnicorn ◴[] No.43178578[source]
The reason 3.5 (as the editor) makes more sense to me is the edit format success rate (99.6% vs 3.7's 93.3%).

Flash 2.0 got 100% on the edit format, and it's extremely cheap, so I'm pretty curious how that would score.