For example, it somehow merged Llama 4 Maverick's custom Arena chatbot version with Behemoth, falsely claiming that the former is stopping the latter from being released. It also claims 40B of internet text data is 10B tokens, which seems a little odd. Llama 405B was also trained on more than 15 trillion tokens[1], but the post claims only 3.67 trillion for some reason. It also doesn't mention Mistral large for some reason, even though it's the first good European 100B+ dense model.
>The MoE arch. enabled larger models to be trained and used by more people - people without access to thousands of interconnected GPUs
You still need thousands of GPUs to train a MoE model of any actual use. This is true for inference in the sense that it's faster I guess, but even that has caveats because MoE models are less powerful than dense models of the same size, though the trade-off has apparently been worth it in many cases. You also didn't need thousands of GPUs to do inference before, even for the largest models.
The conclusion is all over the place, and has lots of just weird and incorrect implications. The title is about how big LLMs are, why is there such a focus on token training count? Also no mention of quantized size. This is a bad AI slop article (whoops, turns out the author accidentally said it was AI generated, so it's a bad human slop article).