There were several links:
- Blog for details: https://homebrew.ltd/blog/llama-learns-to-talk
- Code: https://github.com/homebrewltd/ichigo
- Run locally: https://github.com/homebrewltd/ichigo-demo/tree/docker
- Demo on a single 3090: https://ichigo.homebrew.ltd/
To clarify, while you can enable transcription to see what Ichigo says, Ichigo's design skips directly from audio to speech representations without creating a text transcription of the user’s input. This makes interactions faster but does mean that the user's spoken input isn't transcribed to text.
The flow we use is Speech → Encoder → Speech Representations → LLM → Text → TTS. By skipping the text step, we're able to speed things up and focus on the verbal experience.
Hope this clears things up!