Claude 3.7 Sonnet and Claude Code

(www.anthropic.com)

2127 points bakugo | 2 comments | 24 Feb 25 18:28 UTC | HN request time: 0s | source

Show context

anotherpaulg ◴[24 Feb 25 20:40 UTC] No.43164684[source]▶

>>43163011 (OP) #

Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.

Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.

Aider 0.75.0 is out with support for 3.7 Sonnet [1].

Thinking support and thinking benchmark results coming soon.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html#aider-v0750

replies(18): >>43164827 #>>43165382 #>>43165504 #>>43165555 #>>43165786 #>>43166186 #>>43166253 #>>43166387 #>>43166478 #>>43166688 #>>43166754 #>>43166976 #>>43167970 #>>43170020 #>>43172076 #>>43173004 #>>43173088 #>>43176914 #

anotherpaulg ◴[25 Feb 25 00:46 UTC] No.43166754[source]▶

>>43164684 #

Using up to 32k thinking tokens, Sonnet 3.7 set SOTA with a 64.9% score.

  65% Sonnet 3.7, 32k thinking
  64% R1+Sonnet 3.5
  62% o1 high
  60% Sonnet 3.7, no thinking
  60% o3-mini high
  57% R1
  52% Sonnet 3.5

replies(4): >>43167134 #>>43168719 #>>43168852 #>>43169016 #

mikae1 ◴[25 Feb 25 06:31 UTC] No.43168852[source]▶

>>43166754 #

It's clear that progress is incremental at this point. At the same time Anthropic and OpenAI are bleeding money.

It's unclear to me how they'll shift to making money while providing almost no enhanced value.

replies(1): >>43168989 #

khafra ◴[25 Feb 25 06:52 UTC] No.43168989[source]▶

>>43168852 #

Yudkowsky just mentioned that even if LLM progress stopped right here, right now, there are enough fundamental economic changes to provide us a really weird decade. Even with no moat, if the labs are in any way placed to capture a little of the value they've created, they could make high multiples of their investors' money.

replies(5): >>43169795 #>>43169803 #>>43170002 #>>43171064 #>>43175528 #

weatherlite ◴[25 Feb 25 12:33 UTC] No.43171064[source]▶

>>43168989 #

Like what economic changes? You can make a case people are 10% more productive in very specific fields (programming, perhaps consultancy etc). That's not really an earthquake, the internet/web was probably way more significant.

replies(3): >>43173649 #>>43173863 #>>43180029 #

harshreality ◴[26 Feb 25 02:41 UTC] No.43180029[source]▶

>>43171064 #

It's a force multiplier.

Think of having a secretary, or ten. These secretaries are not as good as an average human at most tasks, but they're good enough for tasks that are easy to double check. You can give them an immense amount of drudgery that would burn out a human.

replies(1): >>43181852 #

1. habinero ◴[26 Feb 25 08:26 UTC] No.43181852[source]▶

>>43180029 #

What drudgery, though? Secretaries don't do a lot of drudgery. And a good one will see tasks that need doing that you didn't specify.

If you're generating immense amounts of really basic make work, that seems like you're managing your time poorly.

replies(1): >>43197683 #

2. harshreality ◴[27 Feb 25 19:40 UTC] No.43197683[source]▶

>>43181852 (TP) #

As one example, LLMs are great at summarizing, or writing or brainstorming outlines of things. They won't display world-class creativity, but as long as they're not hallucinating, their output is quite usable.

Using them to replace core competencies will probably remain forbidden by professional ethics (writing court documents, diagnosing patients, building bridges). However, there are ways for LLMs to assist people without doing their jobs for them.

Law firms are already using LLMs to deal with large amounts of discovery materials. Doctors and researchers probably use it to summarize papers they want to be familiar with but don't have the energy to read themselves. Engineers might eventually be able to use AI to do a rough design, then do all the regulatory and finite element analysis necessary to prove that it's up to code, just like they'd have to do anyway.

I don't have a high-level LLM subscription, but I think with the right tooling, even existing LLMs might already be pretty good at managing schedules and providing reminders.

↑