Claude 3.7 Sonnet and Claude Code

1. azinman2 ◴[24 Feb 25 18:56 UTC] No.43163378[source]▶

To me the biggest surprise was seeking grok dominate in all of their published benchmarks. I haven’t seen any benchmarks of it yet (which I take with a giant heap of salt), but it’s still interesting nevertheless.

I’m rooting for Anthropic.

replies(4): >>43163397 #>>43163430 #>>43163485 #>>43163938 #

2. pertymcpert ◴[24 Feb 25 18:58 UTC] No.43163397[source]▶

>>43163378 (TP) #

Indeed. I wonder what the architecture for Claude and Grok3 is. If they're still dense models was the MoE excitement with R1 was a tad premature...

3. phillipcarter ◴[24 Feb 25 19:01 UTC] No.43163430[source]▶

>>43163378 (TP) #

Neither a statement for or against Grok or Anthropic:

I've now just taken to seeing benchmarks as pretty lines or bars on a chart that are in no way reflective of actual ability for my use cases. Claude has consistently scored lower on some benchmarks for me, but when I use it in a real-world codebase, it's consistently been the only one that doesn't veer off course or "feel wrong". The others do. I can't quantify it, but that's how it goes.

replies(1): >>43163491 #

4. viccis ◴[24 Feb 25 19:04 UTC] No.43163485[source]▶

>>43163378 (TP) #

Yeah, putting it on the opposite side of that comparison chart was a sleezy but likely effective move.

5. vessenes ◴[24 Feb 25 19:05 UTC] No.43163491[source]▶

>>43163430 #

O1 pro is excellent at figuring out complex stuff that Claude misses. It’s my go to mid level debug assistant when Claude spins

replies(3): >>43167331 #>>43169432 #>>43173437 #

6. koakuma-chan ◴[24 Feb 25 19:39 UTC] No.43163938[source]▶

>>43163378 (TP) #

Grok does the most thinking out of all models I tried (it can think for 2+ minutes), and that's why it is so good, though I haven't tried Claude 3.7 yet.

7. maeil ◴[25 Feb 25 02:20 UTC] No.43167331{3}[source]▶

>>43163491 #

Ive found the same but find o3-mini just as good as that. Sonnet is far better as a general model, but when it's an open-ended technical question that isn't just about code, o3-mini figures it out while Sonnet sometimes doesn't. In those cases o3 is less inclined to go with purely the most "obvious" answer when it's wrong.

8. OsrsNeedsf2P ◴[25 Feb 25 08:16 UTC] No.43169432{3}[source]▶

>>43163491 #

I have never, in frontend, backend, or Android, had O1 pro solve a problem Claude 3.5 could not. I've probably tried it close to 20 times now as well

replies(1): >>43173100 #

9. mrcwinn ◴[25 Feb 25 15:26 UTC] No.43173100{4}[source]▶

>>43169432 #

What's really the value of a bunch of random anecdotes on HN — but in any case, I've absolutely had the experience of 3.5 falling over on its face when handling a very complex coding task, and o1 pro nailing it perfectly.

Excited to try 3.7 with reasoning more but so far it seems like a modest, welcome upgrade but not any sort of leapfrog past o1 pro.

10. airstrike ◴[25 Feb 25 15:50 UTC] No.43173437{3}[source]▶

>>43163491 #

I've never had o1 figure something out that Claude Sonnet 3.5 couldn't. I can only imagine the gap has widened with 3.7.