Why would you control the inference at the token level? Wouldn’t the more obvious (and technically superior) place to control repeat analysis of the optimal path through the search space be in the inference engine itself?
Doing it by saying “Wait” feels like fixing dad’s laptop over a phone call. You’ll get there, but driving over and getting hands on is a more effective solution. Realistically, I know that getting “hands on” with the underlying inference architecture is way beyond my own technical ability. Maybe it’s not even feasible, like trying to fix a cold with brain surgery?