Clearly contaminated benchmarks are not very useful, but I do not understand the assertion that we should care about "Qualitative studies of professionals using AI" over "Comparison on real world tasks". I've looked through these benchmarks in details, and I've come to the conclusion that real world performance is all that matters. Everything else is either incredibly subjective or designed to beat a particular prior model.