Claude 3.7 Sonnet and Claude Code

(www.anthropic.com)

2127 points bakugo | 1 comments | 24 Feb 25 18:28 UTC | HN request time: 0.216s | source

Show context

anotherpaulg ◴[24 Feb 25 20:40 UTC] No.43164684[source]▶

>>43163011 (OP) #

Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.

Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.

Aider 0.75.0 is out with support for 3.7 Sonnet [1].

Thinking support and thinking benchmark results coming soon.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html#aider-v0750

replies(18): >>43164827 #>>43165382 #>>43165504 #>>43165555 #>>43165786 #>>43166186 #>>43166253 #>>43166387 #>>43166478 #>>43166688 #>>43166754 #>>43166976 #>>43167970 #>>43170020 #>>43172076 #>>43173004 #>>43173088 #>>43176914 #

nightpool ◴[24 Feb 25 23:55 UTC] No.43166387[source]▶

>>43164684 #

> 225 coding exercises from Exercism

Has there been any effort taken to reduce data leakage of this test set? Sounds like these exercises were available on the internet pre-2023, so they'll probably be included in the training data for any modern model, no?

replies(3): >>43168220 #>>43169155 #>>43169765 #

jonplackett ◴[25 Feb 25 09:14 UTC] No.43169765[source]▶

>>43166387 #

I like to make up my own tests, that way you know it is actually thinking.

Tests that require thinking about the physical world are the most revealing.

My new favourite is:

You have 2 minutes to cool down a cup of coffee to the lowest temp you can.

You have two options: 1. Add cold milk immediately, then let it sit for 2 mins.

2. Let it sit for 2 mins, then add cold milk.

Which one cools the coffee to the lowest temperature and why?

Phrased this way without any help, all but the thinking models get it wrong

replies(12): >>43169841 #>>43169877 #>>43169987 #>>43170077 #>>43170102 #>>43171170 #>>43171376 #>>43173074 #>>43174715 #>>43177608 #>>43182847 #>>43186666 #

ur-whale ◴[25 Feb 25 10:15 UTC] No.43170102[source]▶

>>43169765 #

> I like to make up my own tests

You just ruined your own test by publishing it on the internets

replies(1): >>43173323 #

1. matt-attack ◴[25 Feb 25 15:41 UTC] No.43173323[source]▶

>>43170102 #

Yeah, but he didn’t post the answer.

↑