←back to thread

177 points akadeb | 2 comments | | HN request time: 0.434s | source

Hi HN! Last year the project I launched here got a lot of good feedback on creating speech to speech AI on the ESP32. Recently I revamped the whole stack, iterated on that feedback and made our project fully open-source—all of the client, hardware, firmware code.

This Github repo turns an ESP32-S3 into a realtime AI speech companion using the OpenAI Realtime API, Arduino WebSockets, Deno Edge Functions, and a full-stack web interface. You can talk to your own custom AI character, and it responds instantly.

I couldn't find a resource that helped set up a reliable, secure websocket (WSS) AI speech to speech service. While there are several useful Text-To-Speech (TTS) and Speech-To-Text (STT) repos out there, I believe none gets Speech-To-Speech right. OpenAI launched an embedded-repo late last year which sets up WebRTC with ESP-IDF. However, it's not beginner friendly and doesn't have a server side component for business logic.

This repo is an attempt at solving the above pains and creating a great speech to speech experience on Arduino with Secure Websockets using Edge Servers (with Deno/Supabase Edge Functions) for fast global connectivity and low latency.

Show context
drakenot ◴[] No.43763477[source]
Something that really kills the 'effect' of most of the Voice > AI demos that I see is the cold start / latency.

The OpenAI "Voice Mode" is closer, but when we can have near instantaneous and natural back and forth voice mode, that will be a big in terms of it feeling magical. Today, it is say something, awkwardly wait N seconds then listen to the reply and sometimes awkwardly interrupt it.

Even if the models were no smarter than they are today, if we could crack that "conversational" piece and performance piece, it would be a big difference in my opinion.

replies(3): >>43763547 #>>43764293 #>>43770877 #
Sean-Der ◴[] No.43763547[source]
I think it will always feel unnatural as long as 'AI Speech' is turn based. Right now developers used Voice Activity Detection to detect when the user has stopped talking.

What would be REALLY cool is if we had something that would interrupt you during conversation like talking with a real human.

replies(1): >>43763880 #
1. conductr ◴[] No.43763880[source]
I can see how interruptions would prove even more unnatural and annoying pretty quick. There's a lot of nuance in knowing how to interrupt properly and often, people that interrupt only do so quickly, then yield, allow person to finish then resume - very situational and tons of nuance. Otherwise, with current level of sophistication, you'd just have the AI talking over you the entire time, not allowing you to complete your thoughts/questions/commands/etc and people would quickly be more frustrated and just turn it off.
replies(1): >>43771697 #
2. mst ◴[] No.43771697[source]
I absolutely agree with your analysis wrt current tech - however, I suspect the person you're replying to is talking about "what would be really cool" in terms of it happening in a future where the relevant underpinnings had advanced to the point where it could actually manage the situational/nuance stuff properly.

I almost certainly wouldn't want to use something that tried to implement it now but it's a lovely dream and the state of the art keeps advancing at quite the speed (i.e. faster than I would have predicted, even when I do my best to take into account that it keeps advancing faster than I would have predicted ;).