Neural audio codecs: how to get audio into LLMs

(kyutai.org)

425 points karimf | 2 comments | 21 Oct 25 12:55 UTC | HN request time: 0.41s | source

Show context

miki123211 ◴[21 Oct 25 14:28 UTC] No.45656279[source]▶

> Try asking any of them “Am I speaking in a low voice or a high voice?” in a high-pitched voice, and they won’t be able to tell you.

I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.

AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.

It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.

replies(7): >>45656408 #>>45656467 #>>45656667 #>>45657021 #>>45657291 #>>45658995 #>>45665432 #

userbinator ◴[22 Oct 25 06:14 UTC] No.45665432[source]▶

>>45656279 #

accent matching (if you sound Indian, it shouldn't also sound Indian)

Why not? I've found that it helps greatly with mutual intelligibility when both sides are speaking a similar dialect, and the one who can do this switching, switches to that of the one who can't.

(I wish I could also use an Indian accent bidirectionally; would definitely come in handy for those aggravating times I've had to talk to an outsourced customer service department.)

replies(1): >>45666238 #

1. philipallstar ◴[22 Oct 25 08:24 UTC] No.45666238[source]▶

>>45665432 #

This isn't a logic thing, it's a politeness law invented by white liberal western women thing.

replies(1): >>45670543 #

2. xp84 ◴[22 Oct 25 15:23 UTC] No.45670543[source]▶

>>45666238 (TP) #

Exactly this. It fits that worldview to think a literal computer was “being racist” and mocking the user, even just by copying their speech patterns accurately.

↑