And on back order everywhere. I just spent the last 2 weeks getting a esp32-s3-box setup to do this but its lack of audio out really irks me.
 replies(3): 
The audio out is terrible so I wrote a shim-server that captures the request to the TTS server for heywillow and sent it to a speaker I build myself running MPD on a Pi with a nice DAC and have it play the responses instead of the box-3's tiny speaker.
I don't expect the audio-out on this to be much better with its tiny speaker, but at least it has a 3.5mm jack.
I'm going to look into what that Grove port can do too and perhaps build a new speaker "module" that the Voice PE can sit on top of to make it a proper music device.