Most active commenters

    ←back to thread

    The era of open voice assistants

    (www.home-assistant.io)
    878 points _Microft | 15 comments | | HN request time: 0.431s | source | bottom
    1. lxe ◴[] No.42468351[source]
    Here's what I'm looking for in a voice assistant:

    - Full privacy: nothing goes to the "cloud"

    - Non-shitty microphones and processing: i want to be able to be heard without having to yell, repeat, or correct

    - No wake words: it should listen to everything, process it, and understand when it's being addressed. Since everything is private and local, this is now doable

    - Conversational: it should understand when I finished talking, have ability to be interrupted, all with low latency

    - Non-stupid: it's 2024, and alexa and siri and google are somehow absolutely abysmal at doing even the basics

    - Complete: i don't want to use an app to get stuff configured. I want everything to be controlled via voice

    replies(5): >>42468394 #>>42468471 #>>42468967 #>>42470013 #>>42471806 #
    2. danparsonson ◴[] No.42468394[source]
    > No wake words: it should listen to everything, process it, and understand when it's being addressed

    Even humans struggle with this one - that's what names are for!

    replies(2): >>42468438 #>>42481564 #
    3. antonyt ◴[] No.42468438[source]
    Yeah, I’m having a hard time imagining how no-wake-word could work in practice.
    replies(3): >>42468837 #>>42470838 #>>42473855 #
    4. wild_egg ◴[] No.42468471[source]
    How much are you willing to pay though? Full privacy means powerful enough hardware to do everything else on the list on-device and _quickly_. I don't know that most people have the budget for that
    5. fragmede ◴[] No.42468837{3}[source]
    after setting up the system, if I say "turn the ceiling lights to 20%", who else would be changing the lights?

    But also, post-fix wake word would also be natural if it was recording all the time. "turn on the lights, Google", for instance

    replies(2): >>42472751 #>>42476125 #
    6. nissarup ◴[] No.42468967[source]
    Looks like you are in the market for a butler.

    Especially your last point will, IMO, not be possible for a long time.

    7. Lanolderen ◴[] No.42470013[source]
    I'd imagine with 1-2 TVs constantly talking, general conversations and other random noises it'd get expensive quick. Definitely closer to a rack than a RaspPi or old laptop hardware wise. Also add to that more/better mics for coverage and the complexity of it guessing when you're asking it to remind you to buy toothpaste or your SO... It can probably be done by tracking who's home, who's in the room with the speaker, who the speaker is, etc but it's all cost..
    8. ethbr1 ◴[] No.42470838{3}[source]
    Like that really annoying friend who jumps in every other sentence with "Well actually..."
    replies(1): >>42472167 #
    9. micromacrofoot ◴[] No.42471806[source]
    without a wake word that's a lot of compute unless you live alone and don't watch tv or listen to music

    they even used a wake word in star trek fwiw

    10. marcosdumay ◴[] No.42472167{4}[source]
    I have a coworker that set up an Alexa an year or so ago, I don't know what was the issue, but it would jump into Teams meetings after every noise in his house.
    11. TheCoelacanth ◴[] No.42472751{4}[source]
    Someone in a TV show that you're watching?
    replies(1): >>42479383 #
    12. lukifer ◴[] No.42473855{3}[source]
    This is one advantage of a system with a constrained set of commands/grammars, as opposed to the Alexa/Siri model of trying to process all arbitrary text while in active mode. It can simply ignore/discard any invocations which don't match those specific grammars (and no need to wait to confirm that the device is awake).

    "Computer, turn lights to 50%" -> "turn lights to fifty percent" -> {action: "lights", value: 50}

    "My new computer has a really beefy graphics card" -> "has a really beefy graphics card" -> {action: null}

    replies(1): >>42475451 #
    13. danparsonson ◴[] No.42476125{4}[source]
    Sure, if the system is set up to only respond to very specific commands that humans would not respond to, I guess that could work. I was thinking more about the other way around, where a person might speak to someone else in the room and be overheard and acted upon - "turn on the lights!" could be a command for the computer controlling the room, or the human standing next to the Christmas tree, for example.
    14. joshstrange ◴[] No.42479383{5}[source]
    I’ve never had Alexa control a device via a TV show’ audio but playing back a video of me testing my home automation (“Alex, do X”) triggered my lights.

    I’d love a no-wake-word world where something locally was always chewing on what you said but I’m not sure how well it would work in practice.

    I think it would only take 1-2 instances of it hearing “Hey, who turned off the lights?” in a show turning off my lights for real (and scaring the crap out of me). Doctor Who isn’t particularly scary but if I was watching Silence in the Library and that line turned off my lights I’d be spoked and it would take me a hot minute to realize what happened.

    15. lxe ◴[] No.42481564[source]
    Wake words are different from "listen to everyhing until name is called". A wake work is needed for both privacy and technical reasons -- you can't just have alexa beaming everything it hears to amazon. So instead it uses a local lightweight "dumb" system to listen to specific words only.

    That's exactly why there's massive latencies between command recognition, processing, and execution.

    Imagine if it had sub-ms response to "assistant, add uuh eggs and milk to the shopping list... actually no just eggs sorry"