←back to thread

313 points mariano54 | 10 comments | | HN request time: 0.994s | source | bottom

Hey HN, we're Mariano and Anton from ISSEN (https://issen.com), a foreign language voice tutor app that adapts to your interests, goals, and needs.

Demo: https://www.loom.com/share/a78e713d46934857a2dc88aed1bb100d?...

We started this company after struggling to find great tools to practice speaking Japanese and French. Having a tutor can be awesome, but there are downsides: they can be expensive (since you pay by the hour), difficult to schedule, and have a high upfront cost (finding a tutor you like often forces you to cycle through a few that you don’t).

We wanted something that would talk with us — realistically, in full conversations — and actually help us improve. So we built it ourselves. The app relies on a custom voice AI pipeline combining STT (speech-to-text), TTS (text-to-speech), LLMs, long term memory, interruptions, turn-taking, etc. Getting speech-to-text to work well for learners was one of the hardest parts — especially with accents, multi-lingual sentences, and noisy environments. We now combine Gemini Flash, Whisper, Scribe, and GPT-4o-transcribe to minimize errors and keep the conversation flowing.

We didn’t want to focus too much on gamification. In our experience, that leads to users performing well in the app, achieving long streaks and so on, without actually getting fluent in the language you're wanting to learn.

With ISSEN you instantly speak and immerse yourself in the language, which, while not easy, is a much more efficient way to learn.

We combine this with a word bank and SRS flashcards for new words learned in the AI voice chats, which allows very rapid improvement in both vocabulary and speaking skills. We also create custom curriculums for each student based on goals, interests, and preferences, and fully customizable settings like speed, turn taking, formality, etc.

App: https://issen.com (works on web, iOS, Android) Pricing: 20 min free trial, $20–29/month (depending on duration and specific geography)

We’d love your feedback — on the tech, the UX, or what you’d wish from a tool like this. Thanks!

1. masspro ◴[] No.44392491[source]
I don't think I can trust TTS for language learning. I could be internalizing wrong pronunciation, and I wouldn't know. One time I tried Duolingo for Japanese already knowing a bit. To their credit I assumed it was recorded clips, until it read 'oyogu' as something like 'oyNHYAOgu', like it concatenated two syllable clips that don't go together. If I didn't already know, would I be trying to study and replicate that nonsense? So I don't know if I could trust TTS audio for language study regardless of what kind of tech it is. Sure mistakes can be unlearned over time spent immersing, but at much more effort than just not internalizing them in the first place.

Also Japanese specifically has this meme where it literally is a pitch-accent language but many people say it's not and teaching resources ignore it. E.g. 'ima' means either 'now' or 'living room' depending if syllable #2 is higher or lower. Clearly only applies to some languages, but is another dimension even harder to a learner to know there's a mistake. I have to imagine even other Latin languages probably have reading quirks where this could happen to me.

replies(4): >>44393121 #>>44395180 #>>44395290 #>>44395329 #
2. runarberg ◴[] No.44393121[source]
Also a Japanese learner here—albeit a beginner. As I understand it, the pitch accent is about stress, languages can stress a syllable with length, volume, pitch, etc. Spanish uses vowel length, Icelandic uses volume, English uses a combination of length and volume, and Swedish (just like Japanese) uses pitch. Just like in English if you put the wrong stress on the word it can range anything from sounding foreign to being incomprehensible. (Aside: I always remember trying to say the name of the band Duran Duran to an English speaker, while putting the stress on the first syllable like is normal in Icelandic, but my listener had no idea what I was saying, it took probably 30 attempts before I was corrected with the correct stress).

I think Japanese is somewhat special though for a large number of homonyms (i.e. words that are spelled the same) so speaking with the correct pitch becomes somewhat more important.

replies(1): >>44393542 #
3. glandium ◴[] No.44393542[source]
Somewhat more important, but as someone with decent Japanese who knows about pitch accent but can barely hear the difference in real time, and never actively learned it except for the few well known examples like bridge/chopstick, I don't think it matters all that much. Yes, you'll sound foreign. But you'll be understood nevertheless, in the vast majority of cases.
replies(1): >>44393878 #
4. runarberg ◴[] No.44393878{3}[source]
Speaking of bridge/chopsticks, I created a video to try to spot the difference my self a couple of months ago:

https://imgur.com/KJXanqc

replies(1): >>44395711 #
5. barrell ◴[] No.44395180[source]
Yeah Japanese TTS is a lot harder than it looks. I’m also building a language learning application, and constantly ran into incorrect readings. Eleven labs, eleven labs v3, OpenAI, play.ht, azure, google, Polly — I’ve tried them all. They are all really bad (more than 1/3 the expressions had an error in them somewhere).

It _is_ fixable though. It took me about a week, but I have yet to find a mistaken reading now. This also seems to just be the case with Japanese - most tonal languages seem to have the correct tones (I’m not qualified to comment on how natural the tones sound, but I have yet to find a mismatch like in Japanese)

6. jamager ◴[] No.44395290[source]
Yes. AI transcription is great, AI translation is OK (depending on language pair), but TTS is still pretty awful for most languages.
7. mariano54 ◴[] No.44395329[source]
Minimax's new model is quite good. We use their voices for some of our Japanese tutors. The pitch accent is almost perfect.

There are incorrect reading or Chinese readings occasionally, but you can tell when that happens due to the furigana being different

replies(1): >>44395589 #
8. yorwba ◴[] No.44395589[source]
If you have the correct furigana, you could even detect when the TTS model picked the wrong reading and regenerate.

But how do you know the furigana are correct? Unless you start out fully human-annotated text, you need some automated procedure to add furigana, which pushes the problem from "TTS AI picked the wrong reading" to "furigana AI picked the wrong reading."

replies(1): >>44395895 #
9. glandium ◴[] No.44395711{4}[source]
Here's the problem: pitch accent is easy to hear in isolation and/or in comparison. Under real life conditions, in the middle of a sentence, it's a completely different experience. But then you're saved by context. Because candy is most likely not falling from the sky. Homophones that are still ambiguous in context are possible, but a rare occurrence in my experience.
10. mariano54 ◴[] No.44395895{3}[source]
Yes it pushes the problem, but it's a much easier problem, and models like Gemini flash 2.5 do very well.