(www.anthropic.com)

2127 points bakugo | 1 comments | 24 Feb 25 18:28 UTC | HN request time: 0.292s | source

Show context

anotherpaulg ◴[24 Feb 25 20:40 UTC] No.43164684[source]▶

>>43163011 (OP) #

Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.

Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.

Aider 0.75.0 is out with support for 3.7 Sonnet [1].

Thinking support and thinking benchmark results coming soon.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html#aider-v0750

replies(18): >>43164827 #>>43165382 #>>43165504 #>>43165555 #>>43165786 #>>43166186 #>>43166253 #>>43166387 #>>43166478 #>>43166688 #>>43166754 #>>43166976 #>>43167970 #>>43170020 #>>43172076 #>>43173004 #>>43173088 #>>43176914 #

anotherpaulg ◴[25 Feb 25 00:46 UTC] No.43166754[source]▶

>>43164684 #

Using up to 32k thinking tokens, Sonnet 3.7 set SOTA with a 64.9% score.

  65% Sonnet 3.7, 32k thinking
  64% R1+Sonnet 3.5
  62% o1 high
  60% Sonnet 3.7, no thinking
  60% o3-mini high
  57% R1
  52% Sonnet 3.5

replies(4): >>43167134 #>>43168719 #>>43168852 #>>43169016 #

pclmulqdq ◴[25 Feb 25 01:48 UTC] No.43167134[source]▶

>>43166754 #

Also for $36.83 compared to o1's $186.50

replies(1): >>43168302 #

pzo ◴[25 Feb 25 04:48 UTC] No.43168302[source]▶

>>43167134 #

But also for $36.83 compared to DeepSeek R1 + claude-3-5 it's $13.29 and for latter "Percent using correct edit format" is 100% vs 97.8% for 3.7.

edit: would be interesting to see how combo DeepSeek R1 + claude-3-7 performs.

replies(1): >>43168469 #

tw1984 ◴[25 Feb 25 05:23 UTC] No.43168469[source]▶

>>43168302 #

is there any public info on why such DeepSeek R1 + claude-3-5 combo worked better than using a single model?

replies(3): >>43168727 #>>43168884 #>>43169721 #

1. alienthrowaway ◴[25 Feb 25 06:36 UTC] No.43168884[source]▶

>>43168469 #

Sonnet 3.5 is the best non-Chain-of-Thought code-authoring model. When paired with R1's CoT output, Sonnet 3.5 performs even better - outperforming vanilla R1 (and eveything else), which suggests Sonnet is better than R1 at utilizing R1's CoT.

It's scenario where the result is greater than the sum of it's parts

↑

Claude 3.7 Sonnet and Claude Code