The era of open voice assistants

1. frognumber ◴[20 Dec 24 03:56 UTC] No.42468148[source]▶

I don't fully understand the cloud upsell. I have a beefy GPU. I would like to run the "more advanced" models locally.

By "I don't fully understand," I mean just that. There's a lot of marketing copy, but there's a lot I'd like to understand better before plopping down $$$ for a unit. The answers might be reasonable.

Ideally, I'd be able to experiment with a headset first, and if it works well, upgrade to the $59 unit.

I'd love to just have a README, with a getting started tutorial, play, and then upgrade if it does what I want.

Again: None of this is a complaint. I assume much of this is coming once we're past preview addition, or is perhaps there and my search skills are failing me.

replies(5): >>42468158 #>>42468230 #>>42468247 #>>42468341 #>>42469791 #

2. trb ◴[20 Dec 24 03:59 UTC] No.42468158[source]▶

>>42468148 (TP) #

Finding microphones that look nice, can pick up voice at high enough quality to extract commands and that cover an entire room is surprisingly hard.

If this device delivers on audio quality it's totally worth it at $59.

replies(3): >>42469180 #>>42472386 #>>42476174 #

3. nickthegreek ◴[20 Dec 24 04:14 UTC] No.42468230[source]▶

>>42468148 (TP) #

The cloud sale is easy if you are an HA user already. If you don’t use Home Assistant right now, you probably rec it the target audience. I purchase the yearly cloud service as it’s an easy way to support HA development. It also gives you remote access to your system without having to do any setup. It provides an https connection which allows you to program esp32 devices through Chrome. And now they added the ability to do TTS and STT on someone else’s hardware. HA even allows you to setup a local llm for house control commands but route other queries directly to the cloud.

replies(1): >>42473692 #

4. Jarwain ◴[20 Dec 24 04:20 UTC] No.42468247[source]▶

>>42468148 (TP) #

I can't speak to home assistant specifically, but the last time I looked at voice models, supporting multiple languages and doing it Really Well just happens to require a model with a massive amount of RAM, especially to run at anything resembling real-time.

It's be awesome if they open sourced that model though, or published what models they're using. But I think it unlikely to happen because home assistant is a sorta funnel to nabu casa

That said, from what I can find, it sounds like Assist can be run without the hardware, either with or without the cloud upgrade. So you could definitely use your own hardware, headset, speakers, etc. to play with Assist

replies(1): >>42473697 #

5. antonyt ◴[20 Dec 24 04:40 UTC] No.42468341[source]▶

>>42468148 (TP) #

You can do exactly that - set up an Assist pipeline that glues together services running wherever you want, including a GPU node for faster-whisper. The HA interface even has a screen where you can test your pipeline with your computer’s microphone.

It’s not exactly batteries-included, and doesn’t exercise the on-device wake word detection that satellite hardware would provide, but it’s doable.

But I don’t know that the unit will be an “upgrade” over most headsets. These devices are designed to be cheap, low-power, and have to function in tougher scenarios than speaking directly into a boom mic.

replies(2): >>42468621 #>>42473656 #

6. ilaksh ◴[20 Dec 24 05:48 UTC] No.42468621[source]▶

>>42468341 #

Does it use Node-RED for the pipeline?

replies(1): >>42468766 #

7. haddonist ◴[20 Dec 24 06:25 UTC] No.42468766{3}[source]▶

>>42468621 #

No, all of the voice parts are either inbuilt or direct addons.

8. bdavbdav ◴[20 Dec 24 08:08 UTC] No.42469180[source]▶

>>42468158 #

100%. For a lot of users that have WAF and time available to contend with, this is a steal.

Bear in mind that a $50 google home or Alexa mini(?) is always going to be whatever google deem it to be. This is an open device which can be whatever you want it to be. That’s a lot of value in my eyes.

9. choffee ◴[20 Dec 24 10:21 UTC] No.42469791[source]▶

>>42468148 (TP) #

This device is just the mic/speaker/wakeword part. It connects to home-assistant to do the decoding and automation. You can test it right now by downloading home-assistant and running it on a pi or a VM. You can run all the voice assist stuff locally if you want. There are services for the voice to text, text to voice and what they call intents which are simple things like "turn off the lights in the office". The cloud offering from Nuba Casa, not only funds the development of Home Assistant but also give remote access if you want it. As part of that you can choses to offload some of the voice/text services to their cloud so that if you are just running it on a Pi it will still be fast.

10. alias_neo ◴[20 Dec 24 16:19 UTC] No.42472386[source]▶

>>42468158 #

I've found it quite hard to find decent hardware with both the input capability needed for wakeword and audio capture at a distance, whilst also having decent speaker quality for music playback.

I started using the Box-3 with heywillow which did amazing input and processing using ML on my GPU, but the speaker is aweful. I build a speaker of my own using a raspberry pi Z2W, dac and some speakers in a 3d printed enclosure I designed, and added a shim to the server so that responses came from my speaker rather than the cheap/tiny speaker in the box-3. I'll likely do the same now with the Voice PE, but I'm hoping that the grove connector can be used to plonk it on top of a higher quality speaker unit and make it into a proper music player too.

As soon as I have it in my hands, I intend to get straight to work looking at a way to modify my speaker design to become an addon "module" for the PE.

11. frognumber ◴[20 Dec 24 18:45 UTC] No.42473656[source]▶

>>42468341 #

It's an upgrade mostly because putting on a headset to talk to an assistant means it's not worth using the assistant.

12. frognumber ◴[20 Dec 24 18:48 UTC] No.42473692[source]▶

>>42468230 #

I don't mind paying for hardware. I do mind my privacy, and don't want that kind of information in the cloud, or even traces from encryption I haven't audited myself.

13. frognumber ◴[20 Dec 24 18:49 UTC] No.42473697[source]▶

>>42468247 #

shrug whisper seems to do well on my GPU, and faster than realtime.

replies(2): >>42474063 #>>42476130 #

14. Jarwain ◴[20 Dec 24 19:24 UTC] No.42474063{3}[source]▶

>>42473697 #

Found what I was thinking of [1]

Part of my misremembering is I was thinking of smaller/iot usecase which, alongside the 10GB VRAM requirements for the large multilingual model, felt infeasible -shrug-

[1] https://git.acelerex.com/automation/opcua.ts/-/project_membe...

15. paradox460 ◴[20 Dec 24 23:22 UTC] No.42476130{3}[source]▶

>>42473697 #

I've been using it to generate subtitles for home movies, for an aging family member who is losing their hearing, and it's phenomenal

16. squarefoot ◴[20 Dec 24 23:27 UTC] No.42476174[source]▶

>>42468158 #

In many cases the issue isn't the microphone but the horrid amount of reflections that the sound produces before reaching it. A quite good microphone can be built using cheap, yet very clean, capsules like the AOM-5024L-HD-F-R (80 dB s/n) which is ~$3 at Mouser, but room acoustics is a lot more important and also a real pain in the ass when also not a bank account drain if done professionally, although usually carpets, wood furniture, curtains to cover glass and sound panels on concrete walls can be more than enough.