←back to thread

216 points egnehots | 1 comments | | HN request time: 0.301s | source
Show context
lostmsu ◴[] No.41864820[source]
Very cool, but a bit less practical than some alternatives because it does not seem to do request transcription.
replies(1): >>41865868 #
emreckartal ◴[] No.41865868[source]
Actually, it does. You can turn on the transcription feature from the bottom right corner and even type to Ichigo if you want. We didn’t show it in the launch video since we were focusing on the verbal interaction side of things.
replies(1): >>41866837 #
emreckartal ◴[] No.41866837[source]
Ah, I see now.

To clarify, while you can enable transcription to see what Ichigo says, Ichigo's design skips directly from audio to speech representations without creating a text transcription of the user’s input. This makes interactions faster but does mean that the user's spoken input isn't transcribed to text.

The flow we use is Speech → Encoder → Speech Representations → LLM → Text → TTS. By skipping the text step, we're able to speed things up and focus on the verbal experience.

Hope this clears things up!

replies(1): >>41874204 #
lostmsu ◴[] No.41874204[source]
I understand that. The problem is that in many scenarios users would want to see transcripts of what they said alongside the model output. Like if I have a chat with a model about choosing a place to move to, I would probably also want to review it later. And when I review it, I will see: me: /audio record/ AI: 200-300m. No easy way to see at glance what the AI answer was about.
replies(1): >>41878333 #
readyplayeremma ◴[] No.41878333[source]
You can just run whisper on the conversations as a background job populating the text versions of all the user inputs, so it doesn't interfere with the real-time latency.
replies(1): >>41879825 #
1. lostmsu ◴[] No.41879825[source]
It's not going to match what model hears.