Same here. My two biggest hurdles are:
1. like you mentioned, the second I start talking about something, I totally forget where I'm going, have to pause, it's like my thoughts aren't coming to me. Probably some sort of mental feedback loop plus, like you mentioned, different method of thinking.
2. in the back of my mind, I'm always self-conscious that someone is listening, so it's a privacy / being judged / being overheard feeling which adds a layer of mental feedback.
There's also not great audio clues for handling on-the-fly editing. I've tried to say "parentheses word parentheses" and it just gets written out. I've tried to say "strike that" and it gets written out. These interfaces are very 'happy path' and don't do a lot of processing (on iOS, I can say "period" and get a '.' (or ?,!) but that's about the extent).
I have had some success with long-form recording sessions which are transcribed afterwards. After getting over the short initial hump, I can brain-dump to the recording, and then trust an app like Voice Notes or Superwhisper to transcribe, and then clean up after.
The main issue I run into there, though, is that I either forget to record something (ex. a conversation that I want to review later) or there is too much friction / I don't record often enough to launch it quickly or even remember to use that workflow.
I get the same feeling with smart home stuff - it was awesome for a while to turn lights on and off with voice, but lately there's the added overhead of "did it hear me? do I need to repeat myself? What's the least amount of words I can say? Why can't I just think something into existence instead? Or have a perfect contextual interface on a physical device?"