←back to thread

425 points karimf | 1 comments | | HN request time: 0.001s | source
Show context
bob1029 ◴[] No.45656123[source]
Why not normal audio codecs? How are JPEG and MP3 (i.e., DCT/MDCT) not a reasonable way to go about tokenizing spatial and time domain signals for these kinds of models?

Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.

I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.

replies(6): >>45656175 #>>45656782 #>>45656867 #>>45657008 #>>45657386 #>>45657808 #
ACCount37 ◴[] No.45656782[source]
You can try to train an adapter from a raw 400-byte MP3 frame to an embedding for a given LLM (4096+ floating point numbers, exact precision varies).

But you'd need that information to be digestible for a neural network. Otherwise, you'll have a very hard time getting that adapter to work.

As a rule: neural networks love highly redundant data, and hate highly compressed data at their inputs. Tokenized text good, GZIP compressed bytestream bad. But who knows, really. It's a rule of thumb, not a mathematical law. So you could have some success getting that MP3-based adapter to work. I've seen weirder shit work.

replies(1): >>45660360 #
1. a-dub ◴[] No.45660360{3}[source]
if you were able to normalize and quantokenize the distinct dct values in a consistent way, it could be an interesting approach. so yeah, undo the bit packing but keep the front end signal processing and compressed dct representation and viola! something quite weird that might actually work. :)