One small point: Token selection at each step is fine (and required if you want to be able to additively/linearly/independently handle losses). The problem here is the high inaccuracies in each token (or, rather, their distributions). If you use more time and space to generate the token then those errors go down. If using more time and space cannot suffice then, by construction, energy minimization models and any other solution you can think of also can't reduce the errors far enough.