←back to thread

2127 points bakugo | 7 comments | | HN request time: 0.418s | source | bottom
Show context
anotherpaulg ◴[] No.43164684[source]
Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.

Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.

Aider 0.75.0 is out with support for 3.7 Sonnet [1].

Thinking support and thinking benchmark results coming soon.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html#aider-v0750

replies(18): >>43164827 #>>43165382 #>>43165504 #>>43165555 #>>43165786 #>>43166186 #>>43166253 #>>43166387 #>>43166478 #>>43166688 #>>43166754 #>>43166976 #>>43167970 #>>43170020 #>>43172076 #>>43173004 #>>43173088 #>>43176914 #
anotherpaulg ◴[] No.43166754[source]
Using up to 32k thinking tokens, Sonnet 3.7 set SOTA with a 64.9% score.

  65% Sonnet 3.7, 32k thinking
  64% R1+Sonnet 3.5
  62% o1 high
  60% Sonnet 3.7, no thinking
  60% o3-mini high
  57% R1
  52% Sonnet 3.5
replies(4): >>43167134 #>>43168719 #>>43168852 #>>43169016 #
pclmulqdq ◴[] No.43167134[source]
Also for $36.83 compared to o1's $186.50
replies(1): >>43168302 #
1. pzo ◴[] No.43168302[source]
But also for $36.83 compared to DeepSeek R1 + claude-3-5 it's $13.29 and for latter "Percent using correct edit format" is 100% vs 97.8% for 3.7.

edit: would be interesting to see how combo DeepSeek R1 + claude-3-7 performs.

replies(1): >>43168469 #
2. tw1984 ◴[] No.43168469[source]
is there any public info on why such DeepSeek R1 + claude-3-5 combo worked better than using a single model?
replies(3): >>43168727 #>>43168884 #>>43169721 #
3. Ballas ◴[] No.43168727[source]
From my experiments with the Deepseek Qwen-32b distill model, the Deepseek model did not follow the edit instructions - the format was wrong. I know the distill models are not at all the same as the full model, but that could provide a clue. Combine that information with the scores, then you have a reasonable hypothesis.
replies(1): >>43169268 #
4. alienthrowaway ◴[] No.43168884[source]
Sonnet 3.5 is the best non-Chain-of-Thought code-authoring model. When paired with R1's CoT output, Sonnet 3.5 performs even better - outperforming vanilla R1 (and eveything else), which suggests Sonnet is better than R1 at utilizing R1's CoT.

It's scenario where the result is greater than the sum of it's parts

5. re-thc ◴[] No.43169268{3}[source]
> I know the distill models are not at all the same as the full model

It's far worse than that. It's not the model (Deepseek) at all. It's Qwen enhanced with Deepseek. So it's Qwen still.

6. WiSaGaN ◴[] No.43169721[source]
My personal experience is that R1 is smarter than 3.5 sonnet, but 3.5 sonnet is a better coder. Thus it may be better to let R1 to tackle the problem, but let 3.5 sonnet to implement the solution.
replies(1): >>43171397 #
7. pythonaut_16 ◴[] No.43171397{3}[source]
Specialization of AI models is cool. Just like some people might be better planners and some are better at raw coding ability.