The question would be whether this approach still works when it is scaled to thousands or even millions of qubits. The team is optimistic that that is the case, but we will see.
In some quantum error correcting codes, there is a large set of operators that, when there are currently no errors, measuring these will not change the state (well, assuming the measurement is made without error), but would result in some information about the kind of error if there is an error, and this info can be used to choose what operations to take to correct the error.
For a number of such schemes, there’s a choice of a strategy of what schedule to check which of the measurements with, and how to correct the errors.
Disclaimer: am one of the authors, but not a main contributor. I wrote the simulator they used and made some useful suggestions on how to use it to extract information they wanted for training the models more efficiently, but know nothing of transformers.
In a quantum computer, your logical quantum state is encoded in lots of physical qubits (called data qubits) in some special way. The errors that occur on these qubits are indeed arbitrary, and for enough physical qubits are indeed not practically classically simulatable.
To tackle these errors, we do "syndrome measurement" i.e. interact the data qubits with another set of physical qubits (called syndrome qubits), in a special way, and then measure the syndrome qubits. The quantum magic that happens is that the arbitrary errors get projected down to a countable and finite set of classical errors on the data and syndrome qubits!!! Without this magic result we would have no hope for quantum computers.
Anyway, this is where a decoder - a classical algorithm running on a classical computer - comes in. OP is a decoder. It takes the syndrome qubit measurements and tries to figure out what classical errors occurred and what sort of correction, if any, is needed on the data qubits.