Devstral

(mistral.ai)

701 points mfiguiere | 1 comments | 21 May 25 14:21 UTC | HN request time: 0s | source

Show context

jwr ◴[22 May 25 05:29 UTC] No.44059039[source]▶

My experience with LLMs seems to indicate that the benchmark numbers are more and more detached from reality, at least my reality.

I tested this model with several of my Clojure problems and it is significantly worse than qwen3:30b-a3b-q4_K_M.

I don't know what to make of this. I don't trust benchmarks much anymore.

replies(1): >>44059264 #

NitpickLawyer ◴[22 May 25 06:18 UTC] No.44059264[source]▶

>>44059039 #

How did you test this? Note that this is not a regular coding model (i.e. write a function that does x). This is a fine-tuned model specifically post-trained on a cradle (open hands, ex open devin). So their main focus was to enable the "agentic" flows, with tool use, where you give the model a broad task (say a git ticket) and it starts by search_repo() or read_docs(), followed by read_file() in your repo, then edit_file(), then run_tests() and so on. It's intended to first solve those problems. They suggest using it w/ open hands for best results.

Early reports from reddit say that it also works in cline, while other stronger coding models had issues (they were fine-tuned more towards a step-by-step chat with a user). I think this distinction is important to consider when testing.

replies(3): >>44060980 #>>44064209 #>>44070237 #

1. tasuki ◴[22 May 25 17:14 UTC] No.44064209[source]▶

>>44059264 #

> "write a function that does x"

Which model is optimized to do that? This is what I want out of LLMs! And also talking high level architecture (without any code) and library discovery, but I guess the general talking models are good for that...

↑