Popular/hot comments

(www.anthropic.com)

Show context

anotherpaulg ◴[24 Feb 25 20:40 UTC] No.43164684[source]▶

>>43163011 (OP) #

Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.

Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.

Aider 0.75.0 is out with support for 3.7 Sonnet [1].

Thinking support and thinking benchmark results coming soon.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html#aider-v0750

replies(18): >>43164827 #>>43165382 #>>43165504 #>>43165555 #>>43165786 #>>43166186 #>>43166253 #>>43166387 #>>43166478 #>>43166688 #>>43166754 #>>43166976 #>>43167970 #>>43170020 #>>43172076 #>>43173004 #>>43173088 #>>43176914 #

anotherpaulg ◴[25 Feb 25 00:46 UTC] No.43166754[source]▶

>>43164684 #

Using up to 32k thinking tokens, Sonnet 3.7 set SOTA with a 64.9% score.

  65% Sonnet 3.7, 32k thinking
  64% R1+Sonnet 3.5
  62% o1 high
  60% Sonnet 3.7, no thinking
  60% o3-mini high
  57% R1
  52% Sonnet 3.5

replies(4): >>43167134 #>>43168719 #>>43168852 #>>43169016 #

pclmulqdq ◴[25 Feb 25 01:48 UTC] No.43167134[source]▶

>>43166754 #

Also for $36.83 compared to o1's $186.50

replies(1): >>43168302 #

1. pzo ◴[25 Feb 25 04:48 UTC] No.43168302[source]▶

>>43167134 #

But also for $36.83 compared to DeepSeek R1 + claude-3-5 it's $13.29 and for latter "Percent using correct edit format" is 100% vs 97.8% for 3.7.

edit: would be interesting to see how combo DeepSeek R1 + claude-3-7 performs.

replies(1): >>43168469 #

2. tw1984 ◴[25 Feb 25 05:23 UTC] No.43168469[source]▶

>>43168302 (TP) #

is there any public info on why such DeepSeek R1 + claude-3-5 combo worked better than using a single model?

replies(3): >>43168727 #>>43168884 #>>43169721 #

3. Ballas ◴[25 Feb 25 06:08 UTC] No.43168727[source]▶

>>43168469 #

From my experiments with the Deepseek Qwen-32b distill model, the Deepseek model did not follow the edit instructions - the format was wrong. I know the distill models are not at all the same as the full model, but that could provide a clue. Combine that information with the scores, then you have a reasonable hypothesis.

replies(1): >>43169268 #

4. alienthrowaway ◴[25 Feb 25 06:36 UTC] No.43168884[source]▶

>>43168469 #

Sonnet 3.5 is the best non-Chain-of-Thought code-authoring model. When paired with R1's CoT output, Sonnet 3.5 performs even better - outperforming vanilla R1 (and eveything else), which suggests Sonnet is better than R1 at utilizing R1's CoT.

It's scenario where the result is greater than the sum of it's parts

5. re-thc ◴[25 Feb 25 07:43 UTC] No.43169268{3}[source]▶

>>43168727 #

> I know the distill models are not at all the same as the full model

It's far worse than that. It's not the model (Deepseek) at all. It's Qwen enhanced with Deepseek. So it's Qwen still.

6. WiSaGaN ◴[25 Feb 25 09:08 UTC] No.43169721[source]▶

>>43168469 #

My personal experience is that R1 is smarter than 3.5 sonnet, but 3.5 sonnet is a better coder. Thus it may be better to let R1 to tackle the problem, but let 3.5 sonnet to implement the solution.

replies(1): >>43171397 #

7. pythonaut_16 ◴[25 Feb 25 13:14 UTC] No.43171397{3}[source]▶

>>43169721 #

Specialization of AI models is cool. Just like some people might be better planners and some are better at raw coding ability.

↑

Claude 3.7 Sonnet and Claude Code