And then even simple skills can't understand what I'm asking 60% of the time. The first maybe 2 years after launch it seemed like everything worked pretty good but since then it's been a frustrating decline.
Currently they are relagated to timers and music, and it can't even manage those half the time anymore.
I believe it boils down to two main issues:
- The narrow AI systems used for intent inference have not scaled with the product features.
- Amazon is stuck and can't significantly improve it using general AI due to costs.
The first point is that the speech-to-intent algorithms currently in production are quite basic, likely based on the state of the art from 2013. Initially, there were few features available, so the device was fairly effective at inferring what you wanted from a limited set of possibilities. Over time, Amazon introduced more and more features to choose from, but the devices didn't get any smarter. As a result, mismatches between actual intent and inferred intent became more common, giving the impression that the device is getting dumber. In truth, it’s probably getting somewhat smarter, but not enough to compensate for the increasing complexity over time.
The second point is that, clearly, it would be relatively straightforward to create a much smarter Alexa: simply delegate the intent detection to an LLM. However, Amazon can’t do that. By 2019, there were already over 100 million Alexa devices in circulation, and it’s reasonable to assume that number has at least doubled by now. These devices are likely sold at a low margin, and the service is free. If you start requiring GPUs to process millions of daily requests, you would need an enormous, costly infrastructure, which is probably impossible to justify financially—and perhaps even infeasible given the sheer scale of the product.
My prediction is that Amazon cannot save the product, and it will die a slow death. It will probably keep working for years but will likely be relegated by most users to a "dumb" device capable of little more than setting alarms, timers, and providing weather reports.
If you want Jarvis-like intelligence to control your home automation system, the vision of a local assistant using local AI on an efficient GPU, as presented by HA, is the one with the most chance of succeeding. Beyond the privacy benefits of processing everything locally, the primary reason this approach may become common is that it scales linearly with the installation.
If you had a cloud-based solution using Echo-like devices, the problem is that you’d need to scale your cloud infrastructure as you sell more devices. If the service is good, this could become a major challenge. In contrast, if you sell an expensive box with an integrated GPU that does everything locally, you deploy the infrastructure as you sell the product. This eliminates scaling issues and the risks of growing too fast.
Local GPU doesn’t make sense for some of the same reasons you list. First, hardware requirements are changing rapidly. Why would I spend say $500 on a local GPU setup when in two years the LLM running on it will slow to a crawl due to limited resources? Probably would make more sense to rent a GPU on the cloud and upgrade as new generations come out.
Amazon has the opposite situation: their hardware and infra is upgraded en masse so different economies. Also while your GPU is idling at 20-30W while you aren’t home they can have 100% utilization of their resources because their GPUs are not limited to one customer at a time. Plus they can always offload the processing by contracting OpenAI or similar. Google is in an even better position to do this. Running a local LLM today doesn’t make a lot of sense, but it probably will at some point in like 10 years. I base this on the fact that the requirements for a device like a voice assistant are limited so at some point the hardware and software will catch up. We saw this with smartphones: you can now go 5 years without upgrading and things still work fine. But that wasn’t the case 10 years ago.
Second, Amazon definitely goofed. They thought people would use the Echos for shopping. They didn’t. Literally the only uses for them are alarms and timers, controlling lights and other smart home devices, and answering trivia questions. That’s it. What other requirements do you have that don’t fall in this category? And the Echos do this stuff incredibly well. They can do complex variations too, including turning off the lights after a timer goes off, scheduling lights, etc. Amazon is basically giving these devices away but the way to pivot this is to release a line of smart devices that connect to the Echos: smart bulbs and switches, smart locks, etc. They do have TVs which you can control with an Echo fairly well (and it is getting better). An ecosystem of smart devices that seamlessly interoperate will dwarf what HA has to offer (and I say this as someone who is firmly on HA’s side). And this is Amazon’s core competency: consumer devices and sales.
If your requirement is that you want Jarvis, it’s not the voice device part of it that you want. You want what it is connected to: a self driving car you can summon, DoorDash you can order by saying “I want a pizza”, a phone line so it can call your insurance company and dispute a claim on your behalf.
Now the last piece here is privacy and it’s a doozy. The only way to solve this for Amazon is to figure out some form of encrypted computation that allows for your voice prompts to be processed without them ever hearing clear voice versions. Mathematically possible, practically not so much. But clearly consumers don’t give a fuck whatsoever about it. They trust Amazon. That’s why there are hundreds of millions of these devices. So in effect while people on HN think they are the target market for these devices, they are clearly the opposite. We aren’t the thought leaders, we are the Luddites. And again I say this as someone who wishes there was a way to avoid the privacy issue, to have more control over my own tech, etc. I run an extensive HA setup but use Echos for the voice control because at least for now they are be best value. I am excited about TFA because it means there might be a better choice soon. But even here a $59 device is going to have a hard time competing with one that routinely go on sale for $19.