←back to thread

177 points akadeb | 2 comments | | HN request time: 0.42s | source

Hi HN! Last year the project I launched here got a lot of good feedback on creating speech to speech AI on the ESP32. Recently I revamped the whole stack, iterated on that feedback and made our project fully open-source—all of the client, hardware, firmware code.

This Github repo turns an ESP32-S3 into a realtime AI speech companion using the OpenAI Realtime API, Arduino WebSockets, Deno Edge Functions, and a full-stack web interface. You can talk to your own custom AI character, and it responds instantly.

I couldn't find a resource that helped set up a reliable, secure websocket (WSS) AI speech to speech service. While there are several useful Text-To-Speech (TTS) and Speech-To-Text (STT) repos out there, I believe none gets Speech-To-Speech right. OpenAI launched an embedded-repo late last year which sets up WebRTC with ESP-IDF. However, it's not beginner friendly and doesn't have a server side component for business logic.

This repo is an attempt at solving the above pains and creating a great speech to speech experience on Arduino with Secure Websockets using Edge Servers (with Deno/Supabase Edge Functions) for fast global connectivity and low latency.

Show context
ianbicking ◴[] No.43764043[source]
What's been your experience with the Realtime API? I've been doing LLM with voice, but haven't really given it a try – the price is so high, and it feels like it's much harder to control. Specifically that you just get one system prompt and then the model takes over entirely. (Though looking at the API, I see you can inject text and do some other things to play around with the session.)
replies(1): >>43772559 #
1. akadeb ◴[] No.43772559[source]
I agree, it's still pricy. The cost works out better with `gpt-4o-mini-realtime-preview-2024-12-17`.

Yep its constrained to the system prompt but I pass in conversation history with each new session to keep it relevant. It also supports tool calling which is clutch.

Have you tried Hume AI? They've got a neat suite of APIs that give you more control on each session.

replies(1): >>43776898 #
2. ianbicking ◴[] No.43776898[source]
Hume has been on my radar for a long time, but I've never actually used their products. They keep coming out with new lines and yet I never see anyone talk about them... I'm not sure why? Though it's so hard to figure out their offerings, and some seem to actually be wrappers around other LLMs...

Do you know what Hume's latency is like? The completely vertically integrated Realtime API is pretty compelling because of that latency, but it's not as clear to me how they would make that all work with their hybrid system.