←back to thread

177 points akadeb | 1 comments | | HN request time: 0.21s | source

Hi HN! Last year the project I launched here got a lot of good feedback on creating speech to speech AI on the ESP32. Recently I revamped the whole stack, iterated on that feedback and made our project fully open-source—all of the client, hardware, firmware code.

This Github repo turns an ESP32-S3 into a realtime AI speech companion using the OpenAI Realtime API, Arduino WebSockets, Deno Edge Functions, and a full-stack web interface. You can talk to your own custom AI character, and it responds instantly.

I couldn't find a resource that helped set up a reliable, secure websocket (WSS) AI speech to speech service. While there are several useful Text-To-Speech (TTS) and Speech-To-Text (STT) repos out there, I believe none gets Speech-To-Speech right. OpenAI launched an embedded-repo late last year which sets up WebRTC with ESP-IDF. However, it's not beginner friendly and doesn't have a server side component for business logic.

This repo is an attempt at solving the above pains and creating a great speech to speech experience on Arduino with Secure Websockets using Edge Servers (with Deno/Supabase Edge Functions) for fast global connectivity and low latency.

Show context
andruby ◴[] No.43769886[source]
Really nice! Thank you for including a youtube video. It's a little unfortunate that you do time cuts between your "prompt" and the response. I'm curious if you were waiting 0.5s or 10s to get the response. I think the usability/fun of this stands or falls with that latency.

Maybe it could be combined with fastvoiceagent.cerebrium.ai (discussed 10 months ago https://news.ycombinator.com/item?id=40805010) for lower latency

replies(1): >>43772400 #
1. akadeb ◴[] No.43772400[source]
Thanks for the feedback. I have attached the raw unedited video here: https://drive.google.com/file/d/1kEmbVInvUrYFwjddyGL8Rz03c0N... (sorry the video is a bit long ~5min with some intro about my company :-)