At risk of being annoying, answers that feel like high quality human decision making are extremely pleasing and desirable. In the same way, image generators aren't generating six fingered hands because they think it's more pleasing, they're doing it because they're trying to please and not good enough yet.
I'm just most baffled by the "flashes of brilliance" combined with utter stupidity. I remember having a run with early GPT 4 (gpt-4-0314) where it did refactoring work that amazed me. In the past few days I asked a bunch of AIs about similar characters between a popular gacha mobile game and a popular TV show. OpenAI's models were terrible and hallucinated aggressively (4, 4o, 4.5, o3-mini, o3-mini-high), with the exception of o1. DeepSeek R1 only mildly hallucinated and gave bad answers. Gemini 2.5 was the only flagship model that did not hallucinate and gave some decent answers.
I probably should have used some type of grounding, but I honestly assumed the stuff I was asking about should have been in their training datasets.