←back to thread

2127 points bakugo | 2 comments | | HN request time: 0.436s | source
Show context
azinman2 ◴[] No.43163378[source]
To me the biggest surprise was seeking grok dominate in all of their published benchmarks. I haven’t seen any benchmarks of it yet (which I take with a giant heap of salt), but it’s still interesting nevertheless.

I’m rooting for Anthropic.

replies(4): >>43163397 #>>43163430 #>>43163485 #>>43163938 #
phillipcarter ◴[] No.43163430[source]
Neither a statement for or against Grok or Anthropic:

I've now just taken to seeing benchmarks as pretty lines or bars on a chart that are in no way reflective of actual ability for my use cases. Claude has consistently scored lower on some benchmarks for me, but when I use it in a real-world codebase, it's consistently been the only one that doesn't veer off course or "feel wrong". The others do. I can't quantify it, but that's how it goes.

replies(1): >>43163491 #
vessenes ◴[] No.43163491[source]
O1 pro is excellent at figuring out complex stuff that Claude misses. It’s my go to mid level debug assistant when Claude spins
replies(3): >>43167331 #>>43169432 #>>43173437 #
1. OsrsNeedsf2P ◴[] No.43169432[source]
I have never, in frontend, backend, or Android, had O1 pro solve a problem Claude 3.5 could not. I've probably tried it close to 20 times now as well
replies(1): >>43173100 #
2. mrcwinn ◴[] No.43173100[source]
What's really the value of a bunch of random anecdotes on HN — but in any case, I've absolutely had the experience of 3.5 falling over on its face when handling a very complex coding task, and o1 pro nailing it perfectly.

Excited to try 3.7 with reasoning more but so far it seems like a modest, welcome upgrade but not any sort of leapfrog past o1 pro.