(kyutai.org)

425 points karimf | 1 comments | 21 Oct 25 12:55 UTC | HN request time: 0s | source

1. liqilin1567 ◴[22 Oct 25 08:46 UTC] No.45666380[source]▶

Out of curiosity, would it be possible to attach pitch, emotion, tone info as text-based metadata to each word during ASR, so that the asr output retains these metadata?

↑

Neural audio codecs: how to get audio into LLMs