Ok, this is nit-picking, but it's very obvious that the sample voices these were trained with were captured in different audio environments. There's noticeable reverb on the male voice that's not there on the other.
So that's a useful next step: for multi-voice TTS models, make them sound like they're in the same room.