The era of open voice assistants

1. lxe ◴[20 Dec 24 04:42 UTC] No.42468351[source]▶

Here's what I'm looking for in a voice assistant:

- Full privacy: nothing goes to the "cloud"

- Non-shitty microphones and processing: i want to be able to be heard without having to yell, repeat, or correct

- No wake words: it should listen to everything, process it, and understand when it's being addressed. Since everything is private and local, this is now doable

- Conversational: it should understand when I finished talking, have ability to be interrupted, all with low latency

- Non-stupid: it's 2024, and alexa and siri and google are somehow absolutely abysmal at doing even the basics

- Complete: i don't want to use an app to get stuff configured. I want everything to be controlled via voice

replies(5): >>42468394 #>>42468471 #>>42468967 #>>42470013 #>>42471806 #

2. danparsonson ◴[20 Dec 24 04:55 UTC] No.42468394[source]▶

>>42468351 (TP) #

> No wake words: it should listen to everything, process it, and understand when it's being addressed

Even humans struggle with this one - that's what names are for!

replies(2): >>42468438 #>>42481564 #

3. antonyt ◴[20 Dec 24 05:07 UTC] No.42468438[source]▶

>>42468394 #

Yeah, I’m having a hard time imagining how no-wake-word could work in practice.

replies(3): >>42468837 #>>42470838 #>>42473855 #

4. wild_egg ◴[20 Dec 24 05:14 UTC] No.42468471[source]▶

>>42468351 (TP) #

How much are you willing to pay though? Full privacy means powerful enough hardware to do everything else on the list on-device and _quickly_. I don't know that most people have the budget for that

5. fragmede ◴[20 Dec 24 06:44 UTC] No.42468837{3}[source]▶

>>42468438 #

after setting up the system, if I say "turn the ceiling lights to 20%", who else would be changing the lights?

But also, post-fix wake word would also be natural if it was recording all the time. "turn on the lights, Google", for instance

replies(2): >>42472751 #>>42476125 #

6. nissarup ◴[20 Dec 24 07:17 UTC] No.42468967[source]▶

>>42468351 (TP) #

Looks like you are in the market for a butler.

Especially your last point will, IMO, not be possible for a long time.

7. Lanolderen ◴[20 Dec 24 10:52 UTC] No.42470013[source]▶

>>42468351 (TP) #

I'd imagine with 1-2 TVs constantly talking, general conversations and other random noises it'd get expensive quick. Definitely closer to a rack than a RaspPi or old laptop hardware wise. Also add to that more/better mics for coverage and the complexity of it guessing when you're asking it to remind you to buy toothpaste or your SO... It can probably be done by tracking who's home, who's in the room with the speaker, who the speaker is, etc but it's all cost..

8. ethbr1 ◴[20 Dec 24 13:18 UTC] No.42470838{3}[source]▶

>>42468438 #

Like that really annoying friend who jumps in every other sentence with "Well actually..."

replies(1): >>42472167 #

9. micromacrofoot ◴[20 Dec 24 15:17 UTC] No.42471806[source]▶

>>42468351 (TP) #

without a wake word that's a lot of compute unless you live alone and don't watch tv or listen to music

they even used a wake word in star trek fwiw

10. marcosdumay ◴[20 Dec 24 15:56 UTC] No.42472167{4}[source]▶

>>42470838 #

I have a coworker that set up an Alexa an year or so ago, I don't know what was the issue, but it would jump into Teams meetings after every noise in his house.

11. TheCoelacanth ◴[20 Dec 24 17:04 UTC] No.42472751{4}[source]▶

>>42468837 #

Someone in a TV show that you're watching?

replies(1): >>42479383 #

12. lukifer ◴[20 Dec 24 19:06 UTC] No.42473855{3}[source]▶

>>42468438 #

This is one advantage of a system with a constrained set of commands/grammars, as opposed to the Alexa/Siri model of trying to process all arbitrary text while in active mode. It can simply ignore/discard any invocations which don't match those specific grammars (and no need to wait to confirm that the device is awake).

"Computer, turn lights to 50%" -> "turn lights to fifty percent" -> {action: "lights", value: 50}

"My new computer has a really beefy graphics card" -> "has a really beefy graphics card" -> {action: null}

replies(1): >>42475451 #

13. danparsonson ◴[20 Dec 24 23:21 UTC] No.42476125{4}[source]▶

>>42468837 #

Sure, if the system is set up to only respond to very specific commands that humans would not respond to, I guess that could work. I was thinking more about the other way around, where a person might speak to someone else in the room and be overheard and acted upon - "turn on the lights!" could be a command for the computer controlling the room, or the human standing next to the Christmas tree, for example.

14. joshstrange ◴[21 Dec 24 12:53 UTC] No.42479383{5}[source]▶

>>42472751 #

I’ve never had Alexa control a device via a TV show’ audio but playing back a video of me testing my home automation (“Alex, do X”) triggered my lights.

I’d love a no-wake-word world where something locally was always chewing on what you said but I’m not sure how well it would work in practice.

I think it would only take 1-2 instances of it hearing “Hey, who turned off the lights?” in a show turning off my lights for real (and scaring the crap out of me). Doctor Who isn’t particularly scary but if I was watching Silence in the Library and that line turned off my lights I’d be spoked and it would take me a hot minute to realize what happened.

15. lxe ◴[21 Dec 24 19:15 UTC] No.42481564[source]▶

>>42468394 #

Wake words are different from "listen to everyhing until name is called". A wake work is needed for both privacy and technical reasons -- you can't just have alexa beaming everything it hears to amazon. So instead it uses a local lightweight "dumb" system to listen to specific words only.

That's exactly why there's massive latencies between command recognition, processing, and execution.

Imagine if it had sub-ms response to "assistant, add uuh eggs and milk to the shopping list... actually no just eggs sorry"

replies(1): >>42485648 #

16. danparsonson ◴[22 Dec 24 11:10 UTC] No.42485648{3}[source]▶

>>42481564 #

Sure OK, maybe it's a beneficial side effect then. However you look at it, trying to get the computer to decide when you are addressing it, without using a name of some sort, could be a very challenging problem to solve, one that even humans struggle with. Surely you've been in a situation where you say something to a room and multiple people think you're talking to them? To borrow an example from elsewhere in the thread, if you say "turn on the lights", are you talking to the computer controlling the room lights, or the human standing next to the Christmas tree?

> Imagine if it had sub-ms response to "assistant, add uuh eggs and milk to the shopping list... actually no just eggs sorry"

Could you elaborate on that? What if that were true?