GPT-4 and professional benchmarks: the wrong answer to the wrong question

Sourcing and contamination is covered in the Appendices in the OpenAI paper, which is quoted by this article, and used to critique the method used to detect contamination.

> Because of OpenAI’s lack of transparency, we can’t answer the contamination question with certainty. But what’s certain is that OpenAI’s method to detect contamination is superficial and sloppy:

> > “We measure cross-contamination between our evaluation dataset and the pre-training data using substring match. Both evaluation and training data are processed by removing all spaces and symbols, keeping only characters (including numbers). For each evaluation example, we randomly select three substrings of 50 characters (or use the entire example if it’s less than 50 characters). A match is identified if any of the three sampled evaluation substrings is a substring of the processed training example. This yields a list of contaminated examples. We discard these and rerun to get uncontaminated scores.”

> This is a brittle method. If a test problem were present in the training set with names and numbers changed, it wouldn’t be detected. Less flaky methods are readily available, such as embedding distances.

> If OpenAI were to use a distance-based method, how similar is too similar? There is no objective answer to this question. So even something as seemingly straightforward as performance on a multiple-choice standardized test is fraught with subjective decisions.