(hermes4.nousresearch.com)

202 points sibellavia | 1 comments | 27 Aug 25 08:58 UTC | HN request time: 0s | source

Technical report: https://arxiv.org/pdf/2508.18255

Show context

lern_too_spel ◴[29 Aug 25 21:11 UTC] No.45069425[source]▶

The charts are utter nonsense. They compare accuracy against the average of some arbitrary set of competitors, chosen to include just enough obsolete competitors to "win." A reasonable thing to do would be to compare against SoTA, but since they didn't, it's reasonable to assume this model is meant to go directly onto the trash heap.

replies(3): >>45069769 #>>45069848 #>>45069996 #

1. whymauri ◴[29 Aug 25 21:45 UTC] No.45069769[source]▶

>>45069425 #

The most direct, non-marketing, non-aesthetic summary is that this model trades off a few points on 'fundamental benchmarks' (GPQA, MATH/AIME, MMLU) in exchange for being a 'more steerable' (less refusals) scaffold for downstream tuning.

Within that framing, I think it's easier to see where and how the model fits into the larger ecosystem. But, of course, the best benchmark will always be just using the model.

↑

Hermes 4