> One of the consequences of this is that we should always consider asking the LLM the same question more than once, perhaps with some variation in the wording. Then we can compare answers, indeed perhaps ask the LLM to compare answers for us. The difference in the answers can be as useful as the answers themselves.
There was once a coding agent which achieved SOTA performance on SWE Bench Verified by "just" running the agent 5 times on each instance, scoring each attempt and picking the attempt with the highest score: https://aide.dev/blog/sota-bitter-lesson