Hermes 4

(hermes4.nousresearch.com)

202 points sibellavia | 1 comments | 27 Aug 25 08:58 UTC | HN request time: 0.299s | source

Technical report: https://arxiv.org/pdf/2508.18255

Show context

lern_too_spel ◴[29 Aug 25 21:11 UTC] No.45069425[source]▶

The charts are utter nonsense. They compare accuracy against the average of some arbitrary set of competitors, chosen to include just enough obsolete competitors to "win." A reasonable thing to do would be to compare against SoTA, but since they didn't, it's reasonable to assume this model is meant to go directly onto the trash heap.

replies(3): >>45069769 #>>45069848 #>>45069996 #

jug ◴[29 Aug 25 22:10 UTC] No.45069996[source]▶

>>45069425 #

The tech report compares against DeepSeek R1 671B, DeepSeek V3 671B, Qwen3 235B which have been regarded as SOTA class among ”open" models.

I think this one holds its own surprisingly well in benchmarks for using the nowadays rather, let’s say battle tested Llama 3.1 base, a testament to its quality (Llama 3.2 & 3.3 didn’t employ new bases IIRC, only being new fine tunes, hence I think the explanation to why Hermes 4 is still based on 3.1… and of course Llama 4 never happened, right guys).

However for real use, I wouldn’t bother with the 405B model? I think the age of the base is kind of showing in especially long contexts. It’s like throwing a load of compute on something that is kinda aged to begin with. You’d probably be better off with DeepSeek V3.1 or (my new favorite) GLM 4.5. The latter will perform significantly better than this with less parameters.

The 70B one seems more sensible to me, if you want (yet another) decent unaligned model to have fun with for whatever reason.

replies(1): >>45071103 #

1. BoorishBears ◴[30 Aug 25 01:18 UTC] No.45071103[source]▶

>>45069996 #

You're seeming missing the release announcement does have a very ridiculous graph that their comment is right to call out:

- For refusals they broke out each model's percentage.

- For "% of Questions Correct by Category" they literally grouped an unnamed set of models, averaged out their scores, and combined them as "Other"...

That's hilariously sketchy.

It's also strange that the graph for "Questions Correct" includes creativity and writing. Those don't have correct answers, only win rates, and wouldn't really fit into the same graph.

↑