Claude 3.7 Sonnet and Claude Code

(www.anthropic.com)

2127 points bakugo | 3 comments | 24 Feb 25 18:28 UTC | HN request time: 0.001s | source

Show context

jumploops ◴[24 Feb 25 19:09 UTC] No.43163548[source]▶

> "[..] in developing our reasoning models, we’ve optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that better reflect how businesses actually use LLMs.”

This is good news. OpenAI seems to be aiming towards "the smartest model," but in practice, LLMs are used primarily as learning aids, data transformers, and code writers.

Balancing "intelligence" with "get shit done" seems to be the sweet spot, and afaict one of the reasons the current crop of developer tools (Cursor, Windsurf, etc.) prefer Claude 3.5 Sonnet over 4o.

replies(4): >>43163694 #>>43164052 #>>43164203 #>>43164889 #

eschluntz ◴[24 Feb 25 19:59 UTC] No.43164203[source]▶

>>43163548 #

Thanks! We all dogfood Claude every day to do our own work here, and solving our own pain points is more exciting to us than abstract benchmarks.

Getting things done require a lot of booksmarts, but also a lot of "street smarts" - knowing when to answer quickly, when to double back, etc

replies(2): >>43164322 #>>43164660 #

1. jasonjmcghee ◴[24 Feb 25 20:37 UTC] No.43164660[source]▶

>>43164203 #

Just want to say nice job and keep it up. Thrilled to start playing with 3.7.

In general, benchmarks seem to very misleading in my experience, and I still prefer sonnet 3.5 for _nearly_ every use case- except massive text tasks, which I use gemini 2.0 pro with the 2M token context window.

replies(2): >>43164703 #>>43165434 #

2. martinald ◴[24 Feb 25 20:41 UTC] No.43164703[source]▶

>>43164660 (TP) #

I find the webdev arena tends to match my experience with models much more closely than other benchmarks: https://web.lmarena.ai/leaderboard. Excited to see how 3.7 performs!

3. jasonjmcghee ◴[24 Feb 25 22:00 UTC] No.43165434[source]▶

>>43164660 (TP) #

An update: "code" is very good. Just did a ~4 hour task in about an hour. It cost $3 which is more than I usual spend in an hour, but very worth it.

↑