←back to thread

The era of open voice assistants

(www.home-assistant.io)
931 points _Microft | 1 comments | | HN request time: 0.254s | source
Show context
lxe ◴[] No.42468351[source]
Here's what I'm looking for in a voice assistant:

- Full privacy: nothing goes to the "cloud"

- Non-shitty microphones and processing: i want to be able to be heard without having to yell, repeat, or correct

- No wake words: it should listen to everything, process it, and understand when it's being addressed. Since everything is private and local, this is now doable

- Conversational: it should understand when I finished talking, have ability to be interrupted, all with low latency

- Non-stupid: it's 2024, and alexa and siri and google are somehow absolutely abysmal at doing even the basics

- Complete: i don't want to use an app to get stuff configured. I want everything to be controlled via voice

replies(5): >>42468394 #>>42468471 #>>42468967 #>>42470013 #>>42471806 #
danparsonson ◴[] No.42468394[source]
> No wake words: it should listen to everything, process it, and understand when it's being addressed

Even humans struggle with this one - that's what names are for!

replies(2): >>42468438 #>>42481564 #
lxe ◴[] No.42481564[source]
Wake words are different from "listen to everyhing until name is called". A wake work is needed for both privacy and technical reasons -- you can't just have alexa beaming everything it hears to amazon. So instead it uses a local lightweight "dumb" system to listen to specific words only.

That's exactly why there's massive latencies between command recognition, processing, and execution.

Imagine if it had sub-ms response to "assistant, add uuh eggs and milk to the shopping list... actually no just eggs sorry"

replies(1): >>42485648 #
1. danparsonson ◴[] No.42485648[source]
Sure OK, maybe it's a beneficial side effect then. However you look at it, trying to get the computer to decide when you are addressing it, without using a name of some sort, could be a very challenging problem to solve, one that even humans struggle with. Surely you've been in a situation where you say something to a room and multiple people think you're talking to them? To borrow an example from elsewhere in the thread, if you say "turn on the lights", are you talking to the computer controlling the room lights, or the human standing next to the Christmas tree?

> Imagine if it had sub-ms response to "assistant, add uuh eggs and milk to the shopping list... actually no just eggs sorry"

Could you elaborate on that? What if that were true?