I work at Cartesia, which operates a TTS API similar to Play [1]. I’d be willing to venture a guess and say that our TTS model, Sonic, is probably SoTA for on-device, but don't quote me on that claim. It's the same model that powers our API.
Sonic can be run on a MacBook Pro. Our API sounds better, of course, since that's running the model on GPUs without any special tricks like quantization. But subjectively the on-device version is good quality and real-time, and it possesses all the capabilities of the larger model, such as voice cloning.
Our co-founders did a demo of the on-device capabilities on the No Priors podcast [2], if you're interested in checking it out for yourself. (I will caveat that this sounds quite a bit worse than if you heard it in person today, since this was an early alpha + it's a recording of the output from a MacBook Pro speaker.)
[1] https://cartesia.ai/sonic
[2] https://youtu.be/neQbqOhp8w0?si=2n1i432r5fDG2tPO&t=1886