That doesn't mean the one what manages to spit it out of its latent space is close to AGI. I wonder how consistently that specific model could. If you tried 10 LLMs maybe all 10 of them could have spit out the answer 1 out of 10 times. Correct problem retrieval by one LLM and failure by the others isn't a great argument for near-AGI. But LLMs will be useful in limited domains for a long time.