However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.
Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.