←back to thread

313 points mariano54 | 1 comments | | HN request time: 0.236s | source

Hey HN, we're Mariano and Anton from ISSEN (https://issen.com), a foreign language voice tutor app that adapts to your interests, goals, and needs.

Demo: https://www.loom.com/share/a78e713d46934857a2dc88aed1bb100d?...

We started this company after struggling to find great tools to practice speaking Japanese and French. Having a tutor can be awesome, but there are downsides: they can be expensive (since you pay by the hour), difficult to schedule, and have a high upfront cost (finding a tutor you like often forces you to cycle through a few that you don’t).

We wanted something that would talk with us — realistically, in full conversations — and actually help us improve. So we built it ourselves. The app relies on a custom voice AI pipeline combining STT (speech-to-text), TTS (text-to-speech), LLMs, long term memory, interruptions, turn-taking, etc. Getting speech-to-text to work well for learners was one of the hardest parts — especially with accents, multi-lingual sentences, and noisy environments. We now combine Gemini Flash, Whisper, Scribe, and GPT-4o-transcribe to minimize errors and keep the conversation flowing.

We didn’t want to focus too much on gamification. In our experience, that leads to users performing well in the app, achieving long streaks and so on, without actually getting fluent in the language you're wanting to learn.

With ISSEN you instantly speak and immerse yourself in the language, which, while not easy, is a much more efficient way to learn.

We combine this with a word bank and SRS flashcards for new words learned in the AI voice chats, which allows very rapid improvement in both vocabulary and speaking skills. We also create custom curriculums for each student based on goals, interests, and preferences, and fully customizable settings like speed, turn taking, formality, etc.

App: https://issen.com (works on web, iOS, Android) Pricing: 20 min free trial, $20–29/month (depending on duration and specific geography)

We’d love your feedback — on the tech, the UX, or what you’d wish from a tool like this. Thanks!

1. ianbicking ◴[] No.44392105[source]
I've been thinking and playing slightly with this concept myself. A few thoughts:

1. Using a standard transcription service is pretty tricky because it's going to correct the user's speech. Or make it incorrect! Standard transcription is predicated on the speaker saying things correctly.

2. I've tried sending the audio directly to OpenAI to address this issue. I can't say if it works or not. It's very hard to test or understand a system without a transcript as a source of truth!

3. I'd like to learn a new language as a beginner, and all of these AI systems work poorly for this. It's great to immerse the learner in the language, but if you know NOTHING then it's not that helpful.

4. Language learning needs to be MUCH more multimodal than a standard chat. Especially as a beginner.

5. The AI should be generating translations and explanations alongside its responses. I'd like to be able to inspect everything the AI says (in the language I'm learning) to understand it.

6. Emoji would be another easy way to annotate the text.

7. I think giving the user/AI a subject to talk about would be helpful. Again, a subject that is not language-based would be great, like an image or something.

8. As a very new learner I would like an experience where I respond in my native language and then I'm told how to translate this to the language I'm learning. This should include a pronunciation guide. Then I should repeat the phrase I'm given.

9. I should still be able to ask questions in my native language and probably get a response in my native language. But with some prompting the AI should be able to distinguish these two cases.

10. For low latency it's nice if you produce the spoken text quickly, but you still have the opportunity to get the LLM to produce _more_ material immediately after. This is where things like translations can be produced.

11. You probably don't have timestamps on your TTS, but if you did and could highlight words as they were spoken that would be _great_. Probably worth choosing a TTS provider with that in mind.