It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space
replies(3):
If we had a million times the compute? We might have brute forced our way to AGI by now.