←back to thread

456 points ph4evers | 1 comments | | HN request time: 0.213s | source

I've been working on a little side project that combines Duolingo-like listening comprehension exercises with real content .

Every video is transcribed to get much better transcripts than the closed captions. I filter on high quality transcripts, and afterwards a LLM selects only plausible segments for the exercises. This seems to work well for quality control and seems to be reliable enough for these short exercises.

Would love your thoughts!

Show context
dicytea ◴[] No.43544683[source]
I've checked out the Japanese one, but I'd say that it's definitely no where near "real-world content" IMO. Just the usual tortuously slow-paced, artificially dumbed-down dialogue you'd expect out of classroom recordings.

Most of the videos also contain subtitles, which defeats the purpose of the exercises (you can disable the video manually though). Another issue is that some of the words are segmented very unnaturally (e.g. [み][ません]), so it's unclear how you're expected to fill them in.

In the end if what you really want is "real-world content", then you just need to go out there and find them yourselves - they're everywhere.

replies(2): >>43545640 #>>43549745 #
raincole ◴[] No.43545640[source]
> Another issue is that some of the words are segmented very unnaturally

I immediately noticed that too. Are the "gaps" generated by an LLM? I think the model might not understand Japanese very well.

replies(1): >>43546930 #
yorwba ◴[] No.43546930[source]
It's a bit like segmenting "don't see" into "don't" and "see." ません is the negative of the auxiliary ます just as "don't" is the negative of the auxiliary "do." If you have to split Japanese text into words and want to be principled about it, treating ません as a separate word is not a bad way to go about it.

But of course there are other ways, so a "fill in the blank" question with two gaps right next to each other is generally a bad idea.

replies(2): >>43548794 #>>43549888 #
raincole ◴[] No.43549888[source]
The point is not that you can't cut みません into み and ません. The point is that it should be one single gap in the first place.

It's like cutting gaps out of English sentence like this: I'm [go][ing] to beat the shit out of that guy. Sure we know the logical way to break down 'going' is 'go' and '-ing', but it should be one single gap anyway.

replies(1): >>43552320 #
1. johnisgood ◴[] No.43552320[source]
Damn, where did that example come from? :P