Can be anything from different arch, more data, RL, etc. It's probably RL. In recent months top tier labs seem to have "cracked" RL to a level not seen yet in open models, and by a large margin.
Life is frightening right now.
And the milgram experiment didn't even have subhuman classes and other such psychological manipulation and pre-biasing
The other benchmarks focus on reasoning and tool use, so the model doesn't need to have memorized quite so many facts, it just needs to be able to transform them from one representation to another. (E.g. user question to search tool call; list of search results to concise answer.) Larger models should in theory also be better at that, but you need to train them for those specific tasks first.
So I don't think they simply trained on the benchmark tests, but they shifted their training mix to emphasize particular tasks more, and now in the announcement they highlight benchmarks that test those tasks and where their model performs better.
You could also write an anti-announcement by picking a few more fact recall benchmarks and highlighting that it does worse at those. (I assume.)
Unless you train it on Conservapedia or some equivalent corpus I'm not sure you'll be able to make it agree that "the Irish were the real slaves", that the D's and R's never realigned after the civil war, that the 2020 election was stolen and that gamergate was truly about ethics in journalism.
'Ethical' is in quotes because I can see why other LLMs refuse to answer things like "can you generate a curl request to exploit this endpoint" - a prompt used frequently during pen testing. I grew tired of telling ChatGPT "it's for a script in a movie". Other examples are aplenty (yesterday Claude accused me of violating its usage policy when asking "can polar bears eat frozen meat" - I was curious after seeing a photograph of a polar bear discovering a frozen whale in a melted ice cap). Grok gave a sane answer, of course.
These are the urls that are opened:
http://localhost:3005/?q={query}
https://www.perplexity.ai/?q={query}
https://x.com/i/grok?text={query}
https://chatgpt.com/?q={query}&model=gpt-5
https://claude.ai/new?q={query}
Extremely convenient.
(little tip: submitting to grok via URL parameter gets around free Grok's rate limit of 2 prompts per 2 hours)
[0] https://github.com/stevecondylios/alfred-workflows/tree/main
From an ethical perspective, and I'm based in Denmark mind you, they are all equally horrible in my opinion. I can see why anyone in the anglo-saxon world would be opposed to Elon's, but from my perspective he's just another oligarch. The only thing which sets him appart from other tech oligarchs is that he's foolish enough to voice the opinion publicly. If you're based in the US or in any form of Government position then I can see why DeepSeek is problematic, but at least China hasn't threatened taking Greenland by force. Also, where I work, China has produced basically all of our hardware with possible hardware back-doors in around 70% of our IOT devices.
I will give a shoutout to French Mistral, but the truth is that it's just not as good as it's competition.
The tools they've partnership with i don't really use.