The era of open voice assistants

1. fons ◴[20 Dec 24 09:46 UTC] No.42469574[source]▶

I wonder how this compares to the Respeaker 2 https://wiki.seeedstudio.com/ReSpeaker_Mic_Array_v2.0/

The respeaker has 4 mics and can easily cancel out the noise introduced by a custom external speaker

2. stavros ◴[20 Dec 24 09:56 UTC] No.42469642[source]▶

I don't just want the hardware, I want the software too. I want something that will do STT on my speech, send the text to an API endpoint I control, and be able to either speak the text I give it, or live stream an audio response to the speakers.

That's the part I can't do on my own, and then I'll take care of the LLMs myself.

replies(1): >>42472447 #

3. robotfelix ◴[20 Dec 24 10:44 UTC] No.42469951[source]▶

>>42469574 (TP) #

It's worth noting that product is listed in the "Discontinued Products" section of the linked wiki.

Both of the ReSpeaker products in the non-discontinued section (ReSpeaker Lite, ReSpeaker 2-Mics Pi HAT) have only 2 mics, so it appears that things are converging in that direction.

replies(1): >>42472400 #

4. alias_neo ◴[20 Dec 24 16:21 UTC] No.42472400[source]▶

>>42469951 #

The S3-Box-3 also only has two mics, and I found I can talk to that from another room of the house and it detects what I said perfectly fine.

5. alias_neo ◴[20 Dec 24 16:27 UTC] No.42472447[source]▶

>>42469642 #

All of these components are available separately or as add-ons for Home Assistant.

I currently do STT with heywillow[0] and an S3-Box-3 which uses an LLM running on a server I have to do incredibly fast, incredibly accurate STT. It uses Coqui XTTS for TTS, with very high quality LLM based voice; you can also clone a voice by supplying it with a few seconds of audio (I tested cloning my own with frightening results).

Playback to a decent speaker can be done in a bunch of ways; I wrote a shim that captures the TTS request to Coqui and forwards it to a Pi based speaker I built, running MPD which then requests the audio from the STT server (Coqui) and plays it back on my higher quality speaker than the crappy ones built in to the voice-input devices.

If you just want to use what's available HA, there's all of the Wyoming stuff, openWakeword (not necessary if you're using this new Voice PE because it does on-device wakeword), Piper for TTS, or MaryTTS (or others) and Whisper (faster-whisper) for STT, or hook in something else you want to use. You can additionally use the Ollama integration to hook it into an Ollama model running on higher end hardware for proper LLM based reasoning.

[0]heywillow.io

replies(1): >>42472565 #

6. stavros ◴[20 Dec 24 16:41 UTC] No.42472565{3}[source]▶

>>42472447 #

I do the same, Willow has been unmaintained for close to a year, and calling it "incredibly fast" and "incredibly accurate" tells me that we have very different experiences.

replies(1): >>42472649 #

7. alias_neo ◴[20 Dec 24 16:52 UTC] No.42472649{4}[source]▶

>>42472565 #

It's a shame it's been getting no updates, I noticed that, but their secret sauce is all open stuff anyway so just replace them with the upstream components; their box-3 firmware and the application server is really the bit they built (as well as the "correction" service).

If it wasn't fast or accurate for you, what were you running it on? I'm using the large model on a Tesla GPU in a Ryzen 9 server, using the XTTS-2 (Coqui) branch.

The thing about ML based STT/TTS and the reasoning/processing is that you get better performance the more hardware you throw at it; I'm using nearly £4k worth of hardware to do it; is it worth it? No, is it reasonable? Also no, but I already had the hardware and it's doing other things.

I'll switch over to Assist and run Ollama instead now there's some better hardware with on-device wake-word from Nabu.