←back to thread

2127 points bakugo | 3 comments | | HN request time: 0.676s | source
Show context
anotherpaulg ◴[] No.43164684[source]
Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.

Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.

Aider 0.75.0 is out with support for 3.7 Sonnet [1].

Thinking support and thinking benchmark results coming soon.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html#aider-v0750

replies(18): >>43164827 #>>43165382 #>>43165504 #>>43165555 #>>43165786 #>>43166186 #>>43166253 #>>43166387 #>>43166478 #>>43166688 #>>43166754 #>>43166976 #>>43167970 #>>43170020 #>>43172076 #>>43173004 #>>43173088 #>>43176914 #
nightpool ◴[] No.43166387[source]
> 225 coding exercises from Exercism

Has there been any effort taken to reduce data leakage of this test set? Sounds like these exercises were available on the internet pre-2023, so they'll probably be included in the training data for any modern model, no?

replies(3): >>43168220 #>>43169155 #>>43169765 #
jonplackett ◴[] No.43169765[source]
I like to make up my own tests, that way you know it is actually thinking.

Tests that require thinking about the physical world are the most revealing.

My new favourite is:

You have 2 minutes to cool down a cup of coffee to the lowest temp you can.

You have two options: 1. Add cold milk immediately, then let it sit for 2 mins.

2. Let it sit for 2 mins, then add cold milk.

Which one cools the coffee to the lowest temperature and why?

Phrased this way without any help, all but the thinking models get it wrong

replies(12): >>43169841 #>>43169877 #>>43169987 #>>43170077 #>>43170102 #>>43171170 #>>43171376 #>>43173074 #>>43174715 #>>43177608 #>>43182847 #>>43186666 #
1. vintermann ◴[] No.43174715[source]
I have another easy one which thinking models get wrong:

"Anhentafel numbers start with you as 1. To find the Ahhentafel number of someone's father, double it. To find the Ahnentafel number of someone's mother, double it and add one.

Men pass on X chromosome DNA to their daughters, but none to their sons. Women pass on X chromosome DNA to both their sons and daughters.

List the Ahnentafel numbers of the closest 20 ancestors a man may have inherited X DNA from."

For smaller models, it's probably fair to change the question to something like: "Could you have inherited X chromosome DNA from your ancestor with Ahnentafel number 33? Does the answer to that question depend on whether you are a man or a woman?" They still fail.

replies(1): >>43175107 #
2. audiodude ◴[] No.43175107[source]
Yeah I wouldn't call this easy...
replies(1): >>43180975 #
3. vintermann ◴[] No.43180975[source]
You can just do it generation for generation. The only thing hard about it is that it's two explained concepts you need to combine. A model which aces math Olympiad problems shouldn't have any trouble with this whatsoever - unless it's overfitting on them somehow.