Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.
Aider 0.75.0 is out with support for 3.7 Sonnet [1].
Thinking support and thinking benchmark results coming soon.
Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.
Aider 0.75.0 is out with support for 3.7 Sonnet [1].
Thinking support and thinking benchmark results coming soon.
Has there been any effort taken to reduce data leakage of this test set? Sounds like these exercises were available on the internet pre-2023, so they'll probably be included in the training data for any modern model, no?
Could we please get benchmarks for architect / DeepSeek R1 + claude-3-7-20250219 ?
To compare perf and price with Sonnet-3.7-thinking
65% Sonnet 3.7, 32k thinking
64% R1+Sonnet 3.5
62% o1 high
60% Sonnet 3.7, no thinking
60% o3-mini high
57% R1
52% Sonnet 3.5
I've been trying to use Sonnet 3.7 tonight through the Copilot agent and it gets frustrating to see the API 500 halfway through the task list leaving the project in a half baked state, and then and not feeling like I have a good "auto save" to pick up again from.
When it's ready, I squash merge into main.
Being able to get reset back to the last known good state is awesome. If you turn off auto commit, it’s a lot harder to undo one of the steps that the model takes. It’s only a matter of time until it creates nonsense, so you’ll really want the ability to roll it back.
Just work in a branch and you can merge all commits if you want at the end.
The Exercism problems have proven to be very effective at measuring an LLM's ability to modify existing code. I receive a lot of feedback that the aider benchmarks correlate strongly with people's "vibes" on model coding skill. I agree. The scores have felt quite aligned with my hands-on experience coding with most of the top models over the last 18+ months.
To be clear, the purpose of the benchmark is to help me quantitatively assess and improve aider and make it more effective. But it's also turned out to be a great way to measure the coding skill of LLMs.
edit: would be interesting to see how combo DeepSeek R1 + claude-3-7 performs.
I'll be trying out Sonnet 3.7 extended thinking + Sonnet 3.5 or Flash 2.0, which I assume would be at the top of the leaderboard.
Unless "anthropic high usage" which I just watch the incident reports I one shot features regularly.
At a high skill level. Not front end. Back end c# in a small but great framework that has poor documentation. Not just endpoints but full on task queues.
So really, it's a context problem. You're just not laser focusing your context.
Try this:
Set up a context with the exact files needed. Sure ai "should" do that but it doesn't. Especially not cursor or cline. Then try.
Hell try it with clood after I update with 3.7. I bet you, if you clood file it, then you get one shots.
I have a long history of clood being a commit in my projects and it's a clood one shot.
It's scenario where the result is greater than the sum of it's parts
But to each their own!
Overfitting is one of the fundamental issues to contend with when trying to figure out if any type of model at all is useful. If your leaderboard corresponds to vibes and that is your target, you could just have a vibes leaderboard
If the resulting code is not trying to be excessively clever or creative this is actually a good thing in my book.
The novelty and creativity should come from the product itself, especially from the users/customers perspective. Some people are too attached to LLM leaderboards being about novelty. I want reliable results whenever I give the instructions, either be the code, or the specs built into a spec file after throwing some ideas into prompts.
Tests that require thinking about the physical world are the most revealing.
My new favourite is:
You have 2 minutes to cool down a cup of coffee to the lowest temp you can.
You have two options: 1. Add cold milk immediately, then let it sit for 2 mins.
2. Let it sit for 2 mins, then add cold milk.
Which one cools the coffee to the lowest temperature and why?
Phrased this way without any help, all but the thinking models get it wrong
ChatGPT is already my default first place to check something, where it was Google for the previous 20+ years.
I think a better test of thinking is to provide detail about something so mundane and esoteric that no one would have ever thought to communicate it to other people for entertainment, and then ask it a question about that pile of boring details.
EDIT: By the more obvious one, I mean letting it cool and then adding milk. As the temperature difference between the coffee and the surrounding air is higher, the coffee cools down faster. Is this wrong?
I C&P'd it into Claude 3.7 with thinking, and it gave the correct answer (which I'm pretty sure is #2). Including the CoT, where it actually does math (which I haven't checked), and final response.
# THINKING
Let's analyze the two options.
Option 1: Add cold milk immediately, then let it sit for 2 mins.
Option 2: Let it sit for 2 mins, then add cold milk.
To determine which option leads to the lowest final temperature, I need to consider the physics of cooling.
First, let's understand how cooling works in these scenarios:
1. When a hot liquid sits, it cools according to Newton's Law of Cooling, which states that the rate of heat loss is proportional to the temperature difference between the object and its surroundings.
- The rate of cooling is faster when the temperature difference is greater
- As the liquid cools, the rate of cooling decreases
2. When we add cold milk to hot coffee, we get an immediate temperature drop due to mixing liquids of different temperatures. The final temperature after mixing can be calculated using the principle of thermal equilibrium.
Now, let's analyze each option:
Option 1: Add cold milk immediately, then let it sit for 2 mins
- The coffee temperature drops immediately when milk is added
- Then this coffee+milk mixture cools for 2 minutes
- The cooling rate will be based on the temperature difference between the coffee+milk mixture and room temperature
Option 2: Let it sit for 2 mins, then add cold milk
- The black coffee cools for 2 minutes
- Then milk is added, causing an immediate drop in temperature
- The cooling rate during the 2 minutes will be based on the temperature difference between the black coffee and room temperature
The key insight is that hot coffee cools faster than warm coffee because the rate of cooling is proportional to the temperature difference. Since black coffee starts at a higher temperature than coffee with milk, it will lose more heat energy during the 2-minute cooling period.
Let's make this more concrete with some theoretical numbers:
- Let's say black coffee starts at 90°C
- Room temperature is 20°C
- Cold milk is at 5°C and makes up 10% of the final volume
Option 1:
- Adding milk immediately might bring the temperature to about 81.5°C (90 × 0.9 + 5 × 0.1)
- This mixture then cools for 2 minutes from 81.5°C to some lower temperature, let's call it T1
Option 2:
- The black coffee cools for 2 minutes from 90°C to some temperature, let's call it T2
- Then we add milk, which further reduces the temperature to 0.9 × T2 + 0.1 × 5
The question is: which is lower, T1 or 0.9 × T2 + 0.1 × 5?
According to Newton's Law of Cooling, the black coffee in Option 2 will lose more heat during the 2 minutes because it starts at a higher temperature and thus has a greater temperature difference with the surroundings. This means the drop from 90°C to T2 will be greater than the drop from 81.5°C to T1.
Let's think about this mathematically:
According to Newton's Law of Cooling: T(t) = Ts + (T0 - Ts) × e^(-kt) Where:
- T(t) is the temperature at time t
- Ts is the surrounding temperature
- T0 is the initial temperature
- k is the cooling constant
For Option 1:
T1 = 20 + (81.5 - 20) × e^(-k × 2) = 20 + 61.5 × e^(-2k)
For Option 2, before adding milk:
T2 = 20 + (90 - 20) × e^(-k × 2) = 20 + 70 × e^(-2k)
After adding milk in Option 2, the temperature would be:
0.9 × T2 + 0.1 × 5 = 0.9 × (20 + 70 × e^(-2k)) + 0.5
= 18 + 63 × e^(-2k) + 0.5
= 18.5 + 63 × e^(-2k)
Now we compare:
T1 = 20 + 61.5 × e^(-2k)
Final temp in Option 2 = 18.5 + 63 × e^(-2k)
For any positive value of k (which it must be for cooling), e^(-2k) is between 0 and 1.
If e^(-2k) = 1 (no cooling at all):
T1 = 20 + 61.5 = 81.5
Final temp in Option 2 = 18.5 + 63 = 81.5 (They would be the same)
If e^(-2k) = 0 (instant cooling to room temperature): T1 = 20 Final temp in Option 2 = 18.5 (Option 2 would be better)
For values in between, we need to compare 20 + 61.5 × e^(-2k) vs 18.5 + 63 × e^(-2k)
Subtracting: (T1 - Final temp in Option 2) = 1.5 - 1.5 × e^(-2k)
This will be positive (meaning T1 is higher) when e^(-2k) < 1, which is true for all positive k.
So, mathematically, Option 2 (letting it sit, then adding milk) results in a lower final temperature.
This makes physical sense because:
1. The hotter black coffee loses more heat during the 2 minutes than the cooler coffee+milk mixture would
2. The cold milk provides the same cooling effect whether added at the beginning or end
Therefore, Option 2 (letting the coffee sit for 2 minutes, then adding cold milk) will result in the lowest final temperature.
# ANSWER
# Coffee Cooling Analysis
## Option 2 wins: Let it sit for 2 mins, then add cold milk
This will give you the lowest final temperature, and here's why:
### The physics behind it:
1. *Newton's Law of Cooling*: The rate at which something cools is proportional to the temperature difference between the object and its surroundings.
2. *Key insight*: Hotter objects cool faster than cooler ones because the temperature gap with the environment is larger.
### Comparing the options:
*Option 1 (Add milk first, then wait):* - Milk immediately lowers coffee temperature - The coffee+milk mixture cools for 2 minutes, but at a slower rate because it's already cooler
*Option 2 (Wait 2 mins, then add milk):* - Black coffee starts hotter and cools faster during the 2 minutes - More total heat is lost during the waiting period - Then milk provides the same temperature drop at the end
The black coffee's higher initial temperature difference with the room causes it to lose more total heat energy during the waiting period, making Option 2 the more effective cooling strategy.
It is the only model I gave this and actually approached it by writing math. Usually I am not that impressed with reasoning models, but this was quite fun to watch.
The Aider Polyglot website also states that the benchmark " ...asks the LLM to edit source files to complete 225 coding exercises".
However, when looking at the actual tests [0], it is not about editing code bases, it's rather just solving simple programming exercies? What am I missing?
Think pouring water from the faucet into a sink with open drain - if you have high enough flow rate, you can fill the sink faster than it drains. Then, when you turn the faucet off, as the sink is draining, you can still collect plenty of water from it with a cup or a bucket, before the sink fully drains.
The best answer though is to put the coffee on a plate, and forget about the milk.
It might correlate to design level thinking but it also might not.
Unfortunately I only noticed it after the window to edit the comment was closed. If the first person to downvote me had instead suggested I use a pastebin, I might have been able to make the conversation more agreeable to people.
https://aider.chat/docs/llms/anthropic.html#thinking-tokens
I'll make this less clunky soon.
The point I wonder about is the sustainability of every query being 30+ requests. Site owners aren't ready to have 98% of their requests be non-monetizable bot traffic. However, sites that have something to sell are..
Sure, in a hypothetical market where before they try to extract profits most participants aren't losing money with below-profitable prices trying to keep mindshare. But you’d need a breakthrough around which a participant had some kind lf a moat to get, even temporarily, there in the LLM market.
The infrastructure side of things, tens of billions and probably hundreds of billions going in, may not be fantastic for investors. The return on capital should approach cost of capital if someone does their job correctly. Add in government investment and subsidies (in China, the EU, the United States) and it become extremely difficult to make those calculations. In the long term, I don't think the AI infrastructure will be overbuilt (datacenters, fabs), but like the telecom bubble, it is easy to end up in a position where there is a lot of excess capacity and the way you made your bet means getting wiped out.
Of course if you aren't the investor and it isn't your capital, then there is a tremendous amount of money to be made because you have nothing to lose. I've been around a long time, and this is the closest thing I've felt to that inflection point where the web took off.
It's not like the web suddenly was just there, it came slow at first, then everywhere at once, the money came even later.
> synthetic content
I haven't heard this phrase. Thank you, I'll certainly be using it."Anhentafel numbers start with you as 1. To find the Ahhentafel number of someone's father, double it. To find the Ahnentafel number of someone's mother, double it and add one.
Men pass on X chromosome DNA to their daughters, but none to their sons. Women pass on X chromosome DNA to both their sons and daughters.
List the Ahnentafel numbers of the closest 20 ancestors a man may have inherited X DNA from."
For smaller models, it's probably fair to change the question to something like: "Could you have inherited X chromosome DNA from your ancestor with Ahnentafel number 33? Does the answer to that question depend on whether you are a man or a woman?" They still fail.
It is - what? - a fifth anniversary of "the world will be a completely different place in 6 months due to AI advancement"?
"Sam Altman believes AI will change the world" - of course he does, what else is he supposed to say?
Originally electric generators merely replaced steam generators but had no additional productivity gains, this only changed when they changed the rest of the processes around it.
At some point fairly recently, we passed the point at which things that took longer than anyone thought they would take are happening faster than anyone thought they would happen.
/i
I was at the AI Engineering Summit in NYC last week and met an (extremely senior) staff ai engineer doing somewhat unbelievable things with aider. Shocking things tbh.
Is there a good way to share stories about real-world aider projects like this with you directly (if I can get approval from him)? Not sure posting on public forum is appropriate but I think you would be really interested to hear how people are using this tool at the edge.
For small models and when attention is "taken up", these sorts of questions really send a model for a loop. Agreed - especially noticeable with small reasoning models.
Flash 2.0 got 100% on the edit format, and it's extremely cheap, so I'm pretty curious how that would score.
Think of having a secretary, or ten. These secretaries are not as good as an average human at most tasks, but they're good enough for tasks that are easy to double check. You can give them an immense amount of drudgery that would burn out a human.
If you're generating immense amounts of really basic make work, that seems like you're managing your time poorly.
The graph on that page explains it much better than any of the text explanations can.
https://en.m.wikipedia.org/wiki/Synthetic_media
I believe your message got downvoted because is synthetic, but because is wrong or not of interest.
They get caught up in the idea that adding milk first cools it fastest and can’t escape from that
The bigger the temp difference the more cooling. So by putting the milk in right away you make the temp difference between the surroundings and the coffee smaller = less cooling over your 2 mins.
I like puzzles with a satisfying answer
It was meant to be an example of thinking being better…
don't worry it's "all but", so you are fine.
I notice this on HN more than places like reddit and Mastodon.
I think it's a benefit when writing requirements and programming to specs. It's less of a benefit when having a freeform discussion.
LLMs might enable some completely new things to be automated that made no sense to automate before, even if it’s necessary to error correct with humans / computers.
I use LLMs 20-30 times a day and while it feels invaluable for personal use where I can interpret the responses at my own discretion, they still hallucinate enough and have enough lapses in logic where I would never feel confident incorporating them into some critical system.
https://www.visualcapitalist.com/ranked-ai-models-with-the-l...
99% of systems aren't critical and human validation is sufficient. My own use case, it is enough to replace plenty of hours of human labour.
Using them to replace core competencies will probably remain forbidden by professional ethics (writing court documents, diagnosing patients, building bridges). However, there are ways for LLMs to assist people without doing their jobs for them.
Law firms are already using LLMs to deal with large amounts of discovery materials. Doctors and researchers probably use it to summarize papers they want to be familiar with but don't have the energy to read themselves. Engineers might eventually be able to use AI to do a rough design, then do all the regulatory and finite element analysis necessary to prove that it's up to code, just like they'd have to do anyway.
I don't have a high-level LLM subscription, but I think with the right tooling, even existing LLMs might already be pretty good at managing schedules and providing reminders.
People making up their own benchmarks for these things has confirmed one thing for me: The bias that people think they mostly have original thoughts is extremely strong. I find if I have a “good” idea someone has probably already thought of it as well and maybe even written about it. About 0.01% of the time do I have an idea that one may consider novel and even that’s probably my own bias and overstated. This example just confirms that these models don’t really seem to reason and have a really hard time doing the basic generalization they can with fewer examples.
The other even more subtle point is that temp (unlike pressure) doesn't equilibrate instantaneously, as is assumed in the second option. It is a diffusive process which follows a complicated parabolic partial differential, equation, whose solution will determine the final temp of the tea+milk+partial adiabatic convection cooling of the cup. It took Fourier a detour through Complex numbers to figure it out.
I find all LLMs do quite well when presented with a well poised engineering/physics/chemistry problems with well defined parameters, including relativistic or second order effects. The confusion/hallucination is mostly imparted on them by user's lack of sophistication in articulation, or computational limitations.
Also you can definitely argue HN forums are very very high quality training corpus :)