←back to thread

425 points karimf | 1 comments | | HN request time: 0s | source
Show context
bob1029 ◴[] No.45656123[source]
Why not normal audio codecs? How are JPEG and MP3 (i.e., DCT/MDCT) not a reasonable way to go about tokenizing spatial and time domain signals for these kinds of models?

Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.

I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.

replies(6): >>45656175 #>>45656782 #>>45656867 #>>45657008 #>>45657386 #>>45657808 #
PaulDavisThe1st ◴[] No.45656175[source]
The approach in TFA encodes into a 32 dimensional space. I suspect this is significantly more dimensions than any psycho-acoustic compression algorithm uses. Also, throwing away information that our hearing systems can't process very well is not particularly useful if your goal is speech (or more generally, audio) synthesis from scratch.
replies(1): >>45656394 #
bob1029 ◴[] No.45656394[source]
> throwing away information that our hearing systems can't process very well is not particularly useful if your goal is speech (or more generally, audio) synthesis from scratch.

I'm not sure I follow. If there is a set of tokens that the average human cannot perceive, why wouldn't we want to eliminate them from the search space? Who is the target audience for this model?

replies(3): >>45656507 #>>45656712 #>>45656731 #
1. CaptainOfCoit ◴[] No.45656507[source]
Maybe that things outside our audible range could impact/influence things inside of our audible range?