The era of open voice assistants

(www.home-assistant.io)

931 points _Microft | 1 comments | 20 Dec 24 00:29 UTC | HN request time: 0.254s | source

Show context

lxe ◴[20 Dec 24 04:42 UTC] No.42468351[source]▶

Here's what I'm looking for in a voice assistant:

- Full privacy: nothing goes to the "cloud"

- Non-shitty microphones and processing: i want to be able to be heard without having to yell, repeat, or correct

- No wake words: it should listen to everything, process it, and understand when it's being addressed. Since everything is private and local, this is now doable

- Conversational: it should understand when I finished talking, have ability to be interrupted, all with low latency

- Non-stupid: it's 2024, and alexa and siri and google are somehow absolutely abysmal at doing even the basics

- Complete: i don't want to use an app to get stuff configured. I want everything to be controlled via voice

replies(5): >>42468394 #>>42468471 #>>42468967 #>>42470013 #>>42471806 #

danparsonson ◴[20 Dec 24 04:55 UTC] No.42468394[source]▶

>>42468351 #

> No wake words: it should listen to everything, process it, and understand when it's being addressed

Even humans struggle with this one - that's what names are for!

replies(2): >>42468438 #>>42481564 #

lxe ◴[21 Dec 24 19:15 UTC] No.42481564[source]▶

>>42468394 #

Wake words are different from "listen to everyhing until name is called". A wake work is needed for both privacy and technical reasons -- you can't just have alexa beaming everything it hears to amazon. So instead it uses a local lightweight "dumb" system to listen to specific words only.

That's exactly why there's massive latencies between command recognition, processing, and execution.

Imagine if it had sub-ms response to "assistant, add uuh eggs and milk to the shopping list... actually no just eggs sorry"

replies(1): >>42485648 #

1. danparsonson ◴[22 Dec 24 11:10 UTC] No.42485648[source]▶

>>42481564 #

Sure OK, maybe it's a beneficial side effect then. However you look at it, trying to get the computer to decide when you are addressing it, without using a name of some sort, could be a very challenging problem to solve, one that even humans struggle with. Surely you've been in a situation where you say something to a room and multiple people think you're talking to them? To borrow an example from elsewhere in the thread, if you say "turn on the lights", are you talking to the computer controlling the room lights, or the human standing next to the Christmas tree?

> Imagine if it had sub-ms response to "assistant, add uuh eggs and milk to the shopping list... actually no just eggs sorry"

Could you elaborate on that? What if that were true?

↑