OpenAI didn’t conduct the bar exam study, Casetext and Stanford did (gotta read those footnotes). The questions were from after the knowledge cutoff and passed the contamination check.
The main issue is the inapplicability of a test designed for humans, because a LLM's cognition is very different. Contamination presumes thay the style of tests are applicable.