(www.anthropic.com)

2127 points bakugo | 1 comments | 24 Feb 25 18:28 UTC | HN request time: 0.278s | source

Show context

anotherpaulg ◴[24 Feb 25 20:40 UTC] No.43164684[source]▶

>>43163011 (OP) #

Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.

Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.

Aider 0.75.0 is out with support for 3.7 Sonnet [1].

Thinking support and thinking benchmark results coming soon.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html#aider-v0750

replies(18): >>43164827 #>>43165382 #>>43165504 #>>43165555 #>>43165786 #>>43166186 #>>43166253 #>>43166387 #>>43166478 #>>43166688 #>>43166754 #>>43166976 #>>43167970 #>>43170020 #>>43172076 #>>43173004 #>>43173088 #>>43176914 #

anotherpaulg ◴[25 Feb 25 00:46 UTC] No.43166754[source]▶

>>43164684 #

Using up to 32k thinking tokens, Sonnet 3.7 set SOTA with a 64.9% score.

  65% Sonnet 3.7, 32k thinking
  64% R1+Sonnet 3.5
  62% o1 high
  60% Sonnet 3.7, no thinking
  60% o3-mini high
  57% R1
  52% Sonnet 3.5

replies(4): >>43167134 #>>43168719 #>>43168852 #>>43169016 #

mikae1 ◴[25 Feb 25 06:31 UTC] No.43168852[source]▶

>>43166754 #

It's clear that progress is incremental at this point. At the same time Anthropic and OpenAI are bleeding money.

It's unclear to me how they'll shift to making money while providing almost no enhanced value.

replies(1): >>43168989 #

khafra ◴[25 Feb 25 06:52 UTC] No.43168989[source]▶

>>43168852 #

Yudkowsky just mentioned that even if LLM progress stopped right here, right now, there are enough fundamental economic changes to provide us a really weird decade. Even with no moat, if the labs are in any way placed to capture a little of the value they've created, they could make high multiples of their investors' money.

replies(5): >>43169795 #>>43169803 #>>43170002 #>>43171064 #>>43175528 #

weatherlite ◴[25 Feb 25 12:33 UTC] No.43171064[source]▶

>>43168989 #

Like what economic changes? You can make a case people are 10% more productive in very specific fields (programming, perhaps consultancy etc). That's not really an earthquake, the internet/web was probably way more significant.

replies(3): >>43173649 #>>43173863 #>>43180029 #

Seanambers ◴[25 Feb 25 16:22 UTC] No.43173863[source]▶

>>43171064 #

LLMs are fundamentally a new paradigm, it just isn't distributed yet.

It's not like the web suddenly was just there, it came slow at first, then everywhere at once, the money came even later.

replies(2): >>43174832 #>>43187422 #

weatherlite ◴[25 Feb 25 17:30 UTC] No.43174832[source]▶

>>43173863 #

The LLMs are quite widely distributed already, they're just not that impactful. My wife is an accountant at a big 4 and they're all using them (everyone on Microsoft Office is probably using them, which is a lot of people). It's just not the earth shattering tech change CEOS make it to be , at least not yet. We need order of mangitude improvements in things like reliability, factuality and memory for the real economic efficiencies to come and its unclear to me when that's gonna happen.

replies(1): >>43175714 #

KoolKat23 ◴[25 Feb 25 18:47 UTC] No.43175714[source]▶

>>43174832 #

Not necessarily, workflows just need to be adapted to work with it rather than it working in existing workflows. It's something that happens during each industrial revolution.

Originally electric generators merely replaced steam generators but had no additional productivity gains, this only changed when they changed the rest of the processes around it.

replies(1): >>43181666 #

weatherlite ◴[26 Feb 25 07:53 UTC] No.43181666[source]▶

>>43175714 #

I don't get this. What workflow can have occasional catastrophic lapses of reasoning, non factuality, no memory and hallucinations etc? Even in things like customer support this is a no go imo. As long as these very major problems aren't improved (by a lot) the tools will remain very limited.

replies(3): >>43183442 #>>43189359 #>>43189402 #

1. andreasmetsala ◴[26 Feb 25 23:12 UTC] No.43189359[source]▶

>>43181666 #

> What workflow can have occasional catastrophic lapses of reasoning, non factuality, no memory and hallucinations etc?

LLMs might enable some completely new things to be automated that made no sense to automate before, even if it’s necessary to error correct with humans / computers.

↑

Claude 3.7 Sonnet and Claude Code