←back to thread

2127 points bakugo | 2 comments | | HN request time: 0.549s | source
Show context
anotherpaulg ◴[] No.43164684[source]
Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.

Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.

Aider 0.75.0 is out with support for 3.7 Sonnet [1].

Thinking support and thinking benchmark results coming soon.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html#aider-v0750

replies(18): >>43164827 #>>43165382 #>>43165504 #>>43165555 #>>43165786 #>>43166186 #>>43166253 #>>43166387 #>>43166478 #>>43166688 #>>43166754 #>>43166976 #>>43167970 #>>43170020 #>>43172076 #>>43173004 #>>43173088 #>>43176914 #
nightpool ◴[] No.43166387[source]
> 225 coding exercises from Exercism

Has there been any effort taken to reduce data leakage of this test set? Sounds like these exercises were available on the internet pre-2023, so they'll probably be included in the training data for any modern model, no?

replies(3): >>43168220 #>>43169155 #>>43169765 #
jonplackett ◴[] No.43169765[source]
I like to make up my own tests, that way you know it is actually thinking.

Tests that require thinking about the physical world are the most revealing.

My new favourite is:

You have 2 minutes to cool down a cup of coffee to the lowest temp you can.

You have two options: 1. Add cold milk immediately, then let it sit for 2 mins.

2. Let it sit for 2 mins, then add cold milk.

Which one cools the coffee to the lowest temperature and why?

Phrased this way without any help, all but the thinking models get it wrong

replies(12): >>43169841 #>>43169877 #>>43169987 #>>43170077 #>>43170102 #>>43171170 #>>43171376 #>>43173074 #>>43174715 #>>43177608 #>>43182847 #>>43186666 #
danbruc ◴[] No.43169877[source]
No need for thinking, that question can be found discussed and explained many times online and has almost certainly been part of the training data.
replies(1): >>43182786 #
1. jonplackett ◴[] No.43182786[source]
The fact that all the models I’ve tried except the thinking ones get it wrong suggests not.

They get caught up in the idea that adding milk first cools it fastest and can’t escape from that

replies(1): >>43199161 #
2. cristiancavalli ◴[] No.43199161[source]
First page of Google search results from 7 years ago: https://www.quora.com/You-have-2-cups-of-coffee-50-degrees-w...

People making up their own benchmarks for these things has confirmed one thing for me: The bias that people think they mostly have original thoughts is extremely strong. I find if I have a “good” idea someone has probably already thought of it as well and maybe even written about it. About 0.01% of the time do I have an idea that one may consider novel and even that’s probably my own bias and overstated. This example just confirms that these models don’t really seem to reason and have a really hard time doing the basic generalization they can with fewer examples.