Why would this be? I'm probably missing something.
Don't these LLMs fundamentally work by outputting a vector of all possible tokens and strengths assigned to each, which is sampled via some form of sampler (that typically implements some softmax variant, and then picks a random output form that distribution), which now becomes the newest input token, repeat until some limit is hit, or an end of output token is selected?
I don't see why limiting that sampling to the set of valid tokens to fit a grammar should be harmful vs repeated generation until you get something that fits your grammar. (Assuming identical input to both processes.) This is especially the case if you maintain the relative probability of valid (per grammar) tokens in the restricted sampling. If one lets the relative probabilities change substantially, then I could see that giving worse results.
Now, I could certainly imagine blindsiding the LLM with output restrictions when it is expecting to be able to give a freeform response might give worse results than if one prompts it to give output in that format without restricting it. (Simply because forcing an output that is not natural and not a good fit for training can mean the LLM will struggle with creating good output.) I'd imagine the best results likely come from both textually prompting it to give output in your desired format, plus constraining the output to prevent it from accidentally going off the rails.