Ichigo: Local real-time voice AI

(github.com)

217 points egnehots | 3 comments | 14 Oct 24 17:25 UTC | HN request time: 0s | source

There was an announcement on LocalLLaMA: https://www.reddit.com/r/LocalLLaMA/comments/1g38e9s/ichigol...

There were several links:

- Blog for details: https://homebrew.ltd/blog/llama-learns-to-talk

- Code: https://github.com/homebrewltd/ichigo

- Run locally: https://github.com/homebrewltd/ichigo-demo/tree/docker

- Demo on a single 3090: https://ichigo.homebrew.ltd/

Show context

lostmsu ◴[16 Oct 24 23:09 UTC] No.41864820[source]▶

>>41839686 (OP) #

Very cool, but a bit less practical than some alternatives because it does not seem to do request transcription.

replies(1): >>41865868 #

emreckartal ◴[17 Oct 24 02:29 UTC] No.41865868[source]▶

>>41864820 #

Actually, it does. You can turn on the transcription feature from the bottom right corner and even type to Ichigo if you want. We didn’t show it in the launch video since we were focusing on the verbal interaction side of things.

replies(1): >>41866837 #

emreckartal ◴[17 Oct 24 06:02 UTC] No.41866837[source]▶

>>41865868 #

Ah, I see now.

To clarify, while you can enable transcription to see what Ichigo says, Ichigo's design skips directly from audio to speech representations without creating a text transcription of the user’s input. This makes interactions faster but does mean that the user's spoken input isn't transcribed to text.

The flow we use is Speech → Encoder → Speech Representations → LLM → Text → TTS. By skipping the text step, we're able to speed things up and focus on the verbal experience.

Hope this clears things up!

replies(1): >>41874204 #

1. lostmsu ◴[17 Oct 24 21:54 UTC] No.41874204{3}[source]▶

>>41866837 #

I understand that. The problem is that in many scenarios users would want to see transcripts of what they said alongside the model output. Like if I have a chat with a model about choosing a place to move to, I would probably also want to review it later. And when I review it, I will see: me: /audio record/ AI: 200-300m. No easy way to see at glance what the AI answer was about.

replies(1): >>41878333 #

2. readyplayeremma ◴[18 Oct 24 11:21 UTC] No.41878333[source]▶

>>41874204 (TP) #

You can just run whisper on the conversations as a background job populating the text versions of all the user inputs, so it doesn't interfere with the real-time latency.

replies(1): >>41879825 #

3. lostmsu ◴[18 Oct 24 14:32 UTC] No.41879825[source]▶

>>41878333 #

It's not going to match what model hears.

↑