←back to thread

425 points karimf | 1 comments | | HN request time: 0s | source
Show context
crazygringo ◴[] No.45656005[source]
This is fascinating.

Obviously working directly with audio is vastly more complex than with text.

But it is very exciting to see how part of making LLMs work natively with speech, is finding a codec that is maximally efficient at encoding speech.

I even have to wonder if, at some point, we ultimately create a popular voice codec usable with LLMs based not on the Fourier transform or similar, but rather on some kind of set of physical parameters describing vocal cord shape, tongue position, throat/chest/mouth shape, etc.

I can imagine such a model being arrived at statistically (determining the necessary number of parameters), and then almost becoming "hard-coded" as a standard since human anatomy doesn't change much there, beyond certain ranges.

I think it's called formant speech encoding, and it would be interesting if LLMs wind up massively advancing that field. Since I think historically it's had to do more with speech synthesis than audio compression.

replies(3): >>45656334 #>>45657049 #>>45660804 #
1. duped ◴[] No.45657049[source]
In speech coding/synthesis this called a "source-filter" model (decompose speech production into a sound generator in the vocal folds and filter in the vocal tract, and parameterize them) and it's actually older than Tukey and Cooley's rediscovery of the FFT.