(www.anthropic.com)

2127 points bakugo | 1 comments | 24 Feb 25 18:28 UTC | HN request time: 0.209s | source

Show context

azinman2 ◴[24 Feb 25 18:56 UTC] No.43163378[source]▶

To me the biggest surprise was seeking grok dominate in all of their published benchmarks. I haven’t seen any benchmarks of it yet (which I take with a giant heap of salt), but it’s still interesting nevertheless.

I’m rooting for Anthropic.

replies(4): >>43163397 #>>43163430 #>>43163485 #>>43163938 #

phillipcarter ◴[24 Feb 25 19:01 UTC] No.43163430[source]▶

>>43163378 #

Neither a statement for or against Grok or Anthropic:

I've now just taken to seeing benchmarks as pretty lines or bars on a chart that are in no way reflective of actual ability for my use cases. Claude has consistently scored lower on some benchmarks for me, but when I use it in a real-world codebase, it's consistently been the only one that doesn't veer off course or "feel wrong". The others do. I can't quantify it, but that's how it goes.

replies(1): >>43163491 #

vessenes ◴[24 Feb 25 19:05 UTC] No.43163491[source]▶

>>43163430 #

O1 pro is excellent at figuring out complex stuff that Claude misses. It’s my go to mid level debug assistant when Claude spins

replies(3): >>43167331 #>>43169432 #>>43173437 #

1. airstrike ◴[25 Feb 25 15:50 UTC] No.43173437[source]▶

>>43163491 #

I've never had o1 figure something out that Claude Sonnet 3.5 couldn't. I can only imagine the gap has widened with 3.7.

↑

Claude 3.7 Sonnet and Claude Code