GPT-4 and professional benchmarks: the wrong answer to the wrong question

> This is a brittle method. If a test problem were present in the training set with names and numbers changed, it wouldn’t be detected. Less flaky methods are readily available, such as embedding distances.

well honestly i think this is a temporary problem for GPT-4. what you do is fuzz your benchmarks by rephrasing them with GPT itself. the same way the image AI people make their models robust to perturbations. you can generate 100 variations for every 1 "real" test. then train to pass those. you've just unlocked GPT-5.