←back to thread

425 points karimf | 2 comments | | HN request time: 0.001s | source
Show context
crazygringo ◴[] No.45656005[source]
This is fascinating.

Obviously working directly with audio is vastly more complex than with text.

But it is very exciting to see how part of making LLMs work natively with speech, is finding a codec that is maximally efficient at encoding speech.

I even have to wonder if, at some point, we ultimately create a popular voice codec usable with LLMs based not on the Fourier transform or similar, but rather on some kind of set of physical parameters describing vocal cord shape, tongue position, throat/chest/mouth shape, etc.

I can imagine such a model being arrived at statistically (determining the necessary number of parameters), and then almost becoming "hard-coded" as a standard since human anatomy doesn't change much there, beyond certain ranges.

I think it's called formant speech encoding, and it would be interesting if LLMs wind up massively advancing that field. Since I think historically it's had to do more with speech synthesis than audio compression.

replies(3): >>45656334 #>>45657049 #>>45660804 #
1. quinndupont ◴[] No.45656334[source]
There’s a long history of attempts at artificial speech that take this approach, recreating mouth parts and vibrating air. They are all pretty silly, like this work, which fails to understand how writing isn’t just a derivative of speech.
replies(1): >>45656743 #
2. crazygringo ◴[] No.45656743[source]
> They are all pretty silly,

Huh? How?

> like this work which fails to understand how writing isn’t just a derivative of speech.

The whole point of the article is that writing isn't just a derivative of speech. It's in the introduction.