Of course, for such tasks we could benchmark them :
* arithmetic (why would use LLM for that ?)
* correct JSON syntax, correct command lines etc.
* looking for specific information in a text
* looking for a missing information in a text
* language logic (ifs then elses where we know the answer in advance)
But by Goodhart's Law, LLMs that have been trained to succeed in those benchmarks might loose powerfulness in others tasks where we really need them (fuzzy inputs, fuzzy outputs)