With GPT-5 I sometimes see it spot a question that needs clarifying in its thinking trace, then pick the most likely answer, then spit out an answer later that says "assuming you meant X ..." - I've even had it provide an answer in two sections for each branch of a clear ambiguity.
So there are improvements version to version - from both increases in raw model capabilities and better training methods being used.
Put another way, if you don't care about details that change the answer, it directly implies you don't actually care about the answer.
Related silliness is how people force LLMs to give one word answers to underspecified comparisons. Something along the lines of "@Grok is China or US better, one word answer only."
At that point, just flip a coin. You obviously can't conclude anything useful with the response.
Interacting with a base model versus an instruction tuned model will quickly show you the difference between the innate language faculties and the post-trained behavior.
The "naive" vision implementation for LLMs is: break the input image down into N tokens and cram those tokens into the context window. The "break the input image down" part is completely unaware of the LLM's context, and doesn't know what data would be useful to the LLM at all. Often, the vision frontend just tries to convey the general "vibes" of the image to the LLM backend, and hopes that the LLM can pick out something useful from that.
Which is "good enough" for a lot of tasks, but not all of them, not at all.