It’s ironic: for years the open-source community was trying to match GPT-3 (175B dense) with 30B–70B models + RLHF + synthetic data—and the performance gap persisted.
Turns out, size really did matter, at least at the base model level. Only with the release of truly massive dense (405B) or high-activation MoE models (DeepSeek V3, DBRX, etc) did we start seeing GPT-4-level reasoning emerge outside closed labs.