Claude 3.7 Sonnet and Claude Code

(www.anthropic.com)

Show context

anotherpaulg ◴[24 Feb 25 20:40 UTC] No.43164684[source]▶

>>43163011 (OP) #

Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.

Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.

Aider 0.75.0 is out with support for 3.7 Sonnet [1].

Thinking support and thinking benchmark results coming soon.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html#aider-v0750

replies(18): >>43164827 #>>43165382 #>>43165504 #>>43165555 #>>43165786 #>>43166186 #>>43166253 #>>43166387 #>>43166478 #>>43166688 #>>43166754 #>>43166976 #>>43167970 #>>43170020 #>>43172076 #>>43173004 #>>43173088 #>>43176914 #

nightpool ◴[24 Feb 25 23:55 UTC] No.43166387[source]▶

>>43164684 #

> 225 coding exercises from Exercism

Has there been any effort taken to reduce data leakage of this test set? Sounds like these exercises were available on the internet pre-2023, so they'll probably be included in the training data for any modern model, no?

replies(3): >>43168220 #>>43169155 #>>43169765 #

1. anotherpaulg ◴[25 Feb 25 04:34 UTC] No.43168220[source]▶

>>43166387 #

I try not to let perfect be the enemy of good. All benchmarks have limitations.

The Exercism problems have proven to be very effective at measuring an LLM's ability to modify existing code. I receive a lot of feedback that the aider benchmarks correlate strongly with people's "vibes" on model coding skill. I agree. The scores have felt quite aligned with my hands-on experience coding with most of the top models over the last 18+ months.

To be clear, the purpose of the benchmark is to help me quantitatively assess and improve aider and make it more effective. But it's also turned out to be a great way to measure the coding skill of LLMs.

replies(5): >>43169051 #>>43169163 #>>43169274 #>>43169586 #>>43170152 #

2. jrflowers ◴[25 Feb 25 07:24 UTC] No.43169163[source]▶

>>43168220 (TP) #

>I try not to let perfect be the enemy of good. All benchmarks have limitations.

Overfitting is one of the fundamental issues to contend with when trying to figure out if any type of model at all is useful. If your leaderboard corresponds to vibes and that is your target, you could just have a vibes leaderboard

3. Marazan ◴[25 Feb 25 07:44 UTC] No.43169274[source]▶

>>43168220 (TP) #

Having the verbatim answer to the test is not a "limitation" it is an invalidation.

replies(1): >>43169786 #

4. rodrigodlu ◴[25 Feb 25 08:46 UTC] No.43169586[source]▶

>>43168220 (TP) #

That's my perception as well. Most of the time, most of the devs I know, including myself, are not really creating novelty with the code itself, but with the product. (Sometimes even the product is not novel, just a similar enhanced version of existing products)

If the resulting code is not trying to be excessively clever or creative this is actually a good thing in my book.

The novelty and creativity should come from the product itself, especially from the users/customers perspective. Some people are too attached to LLM leaderboards being about novelty. I want reliable results whenever I give the instructions, either be the code, or the specs built into a spec file after throwing some ideas into prompts.

5. ◴[25 Feb 25 09:16 UTC] No.43169786[source]▶

>>43169274 #

6. guccihat ◴[25 Feb 25 10:23 UTC] No.43170152[source]▶

>>43168220 (TP) #

> The Exercism problems have proven to be very effective at measuring an LLM's ability to modify existing code

The Aider Polyglot website also states that the benchmark " ...asks the LLM to edit source files to complete 225 coding exercises".

However, when looking at the actual tests [0], it is not about editing code bases, it's rather just solving simple programming exercies? What am I missing?

[0] https://github.com/Aider-AI/polyglot-benchmark

↑