←back to thread

177 points akadeb | 3 comments | | HN request time: 0.418s | source

Hi HN! Last year the project I launched here got a lot of good feedback on creating speech to speech AI on the ESP32. Recently I revamped the whole stack, iterated on that feedback and made our project fully open-source—all of the client, hardware, firmware code.

This Github repo turns an ESP32-S3 into a realtime AI speech companion using the OpenAI Realtime API, Arduino WebSockets, Deno Edge Functions, and a full-stack web interface. You can talk to your own custom AI character, and it responds instantly.

I couldn't find a resource that helped set up a reliable, secure websocket (WSS) AI speech to speech service. While there are several useful Text-To-Speech (TTS) and Speech-To-Text (STT) repos out there, I believe none gets Speech-To-Speech right. OpenAI launched an embedded-repo late last year which sets up WebRTC with ESP-IDF. However, it's not beginner friendly and doesn't have a server side component for business logic.

This repo is an attempt at solving the above pains and creating a great speech to speech experience on Arduino with Secure Websockets using Edge Servers (with Deno/Supabase Edge Functions) for fast global connectivity and low latency.

Show context
supermatt ◴[] No.43763765[source]
This looks like so much fun! I have recently gotten into working with electronics, so it seems like a nice little project to undertake.

I noticed that it is dependent on openAIs realtime API, so it got me wondering what open alternatives there are as I would love a more realtime alexa-like device in my home that doesnt contact the cloud. I have only played with software, but the existing solutions have never felt realtime to me.

I could only find <https://github.com/fixie-ai/ultravox> that would seem to really work as realtime. It seems to be some model that wires up llama and whisper somehow, rather than treating them as separate steps which is common with other projects.

What other options are available for this kind of real-time behaviour?

replies(3): >>43763841 #>>43763845 #>>43764225 #
1. Sean-Der ◴[] No.43764225[source]
My plan is that Espressif’s WebRTC code[0] will hook up to pipe at [1] that gets you the freedom to do whatever you want.

The design of OpenAI + WebRTC was to lean on WebRTC as much as possible to make it easier for users.

[0] https://github.com/espressif/esp-webrtc-solution

[1] https://github.com/pipecat-ai/pipecat

replies(2): >>43764405 #>>43773343 #
2. supermatt ◴[] No.43764405[source]
Fantastic! This will save a ton of work
3. akadeb ◴[] No.43773343[source]
Pipecat is awesome! is it similar to what livekit provides?

I think Realtime API adoption would be higher if it is offered on Arduino rather than ESP-IDF as the latter is not very beginner friendly. That was one of the main reasons I built this repo using edge functions instead of a direct WebRTC connection.