You cannot have AGI without a physical manifestation that can generate its own training data based on inputs from the external outside world with e.g. sensors and constantly refine its model.
Pure language or pure image-models are just one aspect of intelligence - just very refined pattern recognition.
You will also probably need some aspect of self-awareness in order or the system to set auxiliary goals and directives related to self-maintenance.
But you don't need AGI in order to have something useful (which I think a lot of readers are confused about). No one is making the argument that you need AGI to bring tons of value.