←back to thread

425 points karimf | 1 comments | | HN request time: 0s | source
Show context
bob1029 ◴[] No.45656123[source]
Why not normal audio codecs? How are JPEG and MP3 (i.e., DCT/MDCT) not a reasonable way to go about tokenizing spatial and time domain signals for these kinds of models?

Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.

I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.

replies(6): >>45656175 #>>45656782 #>>45656867 #>>45657008 #>>45657386 #>>45657808 #
1. cubefox ◴[] No.45657008[source]
I believe language models usually use 2-byte (16 bit) tokens, which corresponds to an embedding dimension of 2^16=65536. With 400 bytes per token this would be 2^(400*8), which is an extremely large number. Way too large to be practical, I assume.