(www.anthropic.com)

2127 points bakugo | 2 comments | 24 Feb 25 18:28 UTC | HN request time: 0.001s | source

Show context

anotherpaulg ◴[24 Feb 25 20:40 UTC] No.43164684[source]▶

>>43163011 (OP) #

Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.

Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.

Aider 0.75.0 is out with support for 3.7 Sonnet [1].

Thinking support and thinking benchmark results coming soon.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html#aider-v0750

replies(18): >>43164827 #>>43165382 #>>43165504 #>>43165555 #>>43165786 #>>43166186 #>>43166253 #>>43166387 #>>43166478 #>>43166688 #>>43166754 #>>43166976 #>>43167970 #>>43170020 #>>43172076 #>>43173004 #>>43173088 #>>43176914 #

nightpool ◴[24 Feb 25 23:55 UTC] No.43166387[source]▶

>>43164684 #

> 225 coding exercises from Exercism

Has there been any effort taken to reduce data leakage of this test set? Sounds like these exercises were available on the internet pre-2023, so they'll probably be included in the training data for any modern model, no?

replies(3): >>43168220 #>>43169155 #>>43169765 #

1. chvid ◴[25 Feb 25 07:22 UTC] No.43169155[source]▶

>>43166387 #

They leak the second they are used on a model behind an API, don't they?

replies(1): >>43169851 #

2. chvid ◴[25 Feb 25 09:30 UTC] No.43169851[source]▶

>>43169155 (TP) #

As far as I can tell the only way of doing a comparison of two models, that cannot be easily gamed, is being having them in open weights form and then running them against a benchmark that was created after both of the two models were created.

↑

Claude 3.7 Sonnet and Claude Code