Devstral

(mistral.ai)

701 points mfiguiere | 1 comments | 21 May 25 14:21 UTC | HN request time: 0.201s | source

Show context

jwr ◴[22 May 25 05:29 UTC] No.44059039[source]▶

My experience with LLMs seems to indicate that the benchmark numbers are more and more detached from reality, at least my reality.

I tested this model with several of my Clojure problems and it is significantly worse than qwen3:30b-a3b-q4_K_M.

I don't know what to make of this. I don't trust benchmarks much anymore.

replies(1): >>44059264 #

NitpickLawyer ◴[22 May 25 06:18 UTC] No.44059264[source]▶

>>44059039 #

How did you test this? Note that this is not a regular coding model (i.e. write a function that does x). This is a fine-tuned model specifically post-trained on a cradle (open hands, ex open devin). So their main focus was to enable the "agentic" flows, with tool use, where you give the model a broad task (say a git ticket) and it starts by search_repo() or read_docs(), followed by read_file() in your repo, then edit_file(), then run_tests() and so on. It's intended to first solve those problems. They suggest using it w/ open hands for best results.

Early reports from reddit say that it also works in cline, while other stronger coding models had issues (they were fine-tuned more towards a step-by-step chat with a user). I think this distinction is important to consider when testing.

replies(3): >>44060980 #>>44064209 #>>44070237 #

1. desdenova ◴[22 May 25 11:29 UTC] No.44060980[source]▶

>>44059264 #

I did a very simple tool calling test and it was simply unable to call the tool and use the result.

Maybe it's specialized to use just a few very specific tools? Is there some documentation on how to actually set it up without requiring some weird external platform?

↑