←back to thread

177 points akadeb | 6 comments | | HN request time: 0.675s | source | bottom

Hi HN! Last year the project I launched here got a lot of good feedback on creating speech to speech AI on the ESP32. Recently I revamped the whole stack, iterated on that feedback and made our project fully open-source—all of the client, hardware, firmware code.

This Github repo turns an ESP32-S3 into a realtime AI speech companion using the OpenAI Realtime API, Arduino WebSockets, Deno Edge Functions, and a full-stack web interface. You can talk to your own custom AI character, and it responds instantly.

I couldn't find a resource that helped set up a reliable, secure websocket (WSS) AI speech to speech service. While there are several useful Text-To-Speech (TTS) and Speech-To-Text (STT) repos out there, I believe none gets Speech-To-Speech right. OpenAI launched an embedded-repo late last year which sets up WebRTC with ESP-IDF. However, it's not beginner friendly and doesn't have a server side component for business logic.

This repo is an attempt at solving the above pains and creating a great speech to speech experience on Arduino with Secure Websockets using Edge Servers (with Deno/Supabase Edge Functions) for fast global connectivity and low latency.

1. behnamoh ◴[] No.43763552[source]
am I the only one who finds the unnecessarily positive vibes of OpenAI realtime voices unrealistic, too much, and borderline creepy?
replies(4): >>43763674 #>>43764336 #>>43767803 #>>43771777 #
2. mickael-kerjean ◴[] No.43763674[source]
Yep and having it in a child toy is way beyond the border of creepy
replies(2): >>43765442 #>>43772466 #
3. 3np ◴[] No.43765442[source]
Moreso from the consent- and privacy angle.
4. scyzoryk_xyz ◴[] No.43767803[source]
You’re not the only one, same here.

I believe there will be interest in extracting insights from speech-related fields, performing arts etc. Kind of how there was this transfer of design principles in the 90’s-00’s from traditional typographers, letterform revivals, print techniques.

It’ll be interesting to see an evolution of expectations and culture emerge around AI voices depending on role. Maybe we’ll see these positive voice vibes as silly and naive the same way we see MySpace aesthetics today?

5. mst ◴[] No.43771777[source]
OpenAI stuff in general seems (to me, at least) to be overly positive and confident in terms of how it replies.

While I make no foolish claims that it's perfect, I've found Claude feels much less arrogant, and was genuinely appreciative when one of its replies started with an (accurate, of course I checked primary sources to verify that) analysis of the first half of my question, and then for the more obscure second half said "I'm not sure if I can answer that without hallucinating, but here's some stuff you could try researching."

Certainly Claude's tone and "attitude" (FSVO) works much better for me than any other LLM I've tried, though mileage will, of course, vary.

(I have zero connection to the company and am still on a free account, I'm just quietly impressed relative to the competition)

6. akadeb ◴[] No.43772466[source]
Currently our device is a toy accessory. And for children we are strictly focusing on `Story mode`. Where adventure stories / fairy tales feel more engaging. I think there's value in getting the AI to create epic stories consistently