It's easy to be snarky at ill-informed and hyperbolic takes, but it's also pretty clear that large multi-modal models trained with the data we already have, are going to eventually give us AGI.
IMO this will require not just much more expansive multi-modal training, but also novel architecture, specifically, recurrent approaches; plus a well-known set of capabilities most systems don't currently have, e.g. the integration of short-term memory (context window if you like) into long-term "memory", either episodic or otherwise.
But these are as we say mere matters of engineering.
replies(2):