Neural audio codecs: how to get audio into LLMs

(kyutai.org)

425 points karimf | 1 comments | 21 Oct 25 12:55 UTC | HN request time: 0s | source

Show context

miki123211 ◴[21 Oct 25 14:28 UTC] No.45656279[source]▶

> Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you.

I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.

AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.

It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.

replies(7): >>45656408 #>>45656467 #>>45656667 #>>45657021 #>>45657291 #>>45658995 #>>45665432 #

tsol ◴[21 Oct 25 15:00 UTC] No.45656667[source]▶

>>45656279 #

Did they respond differently depending on what race they thought you were? I'm surprised they would even do that honestly. I thought they were trained on text conversations which presumably wouldn't have any of that to learn from.

replies(4): >>45656799 #>>45656985 #>>45657478 #>>45664768 #

1. vessenes ◴[22 Oct 25 04:00 UTC] No.45664768[source]▶

>>45656667 #

Pre-nerf the 4o voice model had a wide range of expressivity, and it would match affect (still tries to do this) and idiolect of listeners if asked. Nowadays there's a list of accents that are considered "hate-ish" and a list that aren't.

I will elide the rant inside me that west coast 20 somethings get to decide if speaking in a certain accent is racist or "bad". But it's a heartfelt rant.

↑