Within that framing, I think it's easier to see where and how the model fits into the larger ecosystem. But, of course, the best benchmark will always be just using the model.
I think this one holds its own surprisingly well in benchmarks for using the nowadays rather, let’s say battle tested Llama 3.1 base, a testament to its quality (Llama 3.2 & 3.3 didn’t employ new bases IIRC, only being new fine tunes, hence I think the explanation to why Hermes 4 is still based on 3.1… and of course Llama 4 never happened, right guys).
However for real use, I wouldn’t bother with the 405B model? I think the age of the base is kind of showing in especially long contexts. It’s like throwing a load of compute on something that is kinda aged to begin with. You’d probably be better off with DeepSeek V3.1 or (my new favorite) GLM 4.5. The latter will perform significantly better than this with less parameters.
The 70B one seems more sensible to me, if you want (yet another) decent unaligned model to have fun with for whatever reason.
- For refusals they broke out each model's percentage.
- For "% of Questions Correct by Category" they literally grouped an unnamed set of models, averaged out their scores, and combined them as "Other"...
That's hilariously sketchy.
It's also strange that the graph for "Questions Correct" includes creativity and writing. Those don't have correct answers, only win rates, and wouldn't really fit into the same graph.