←back to thread

425 points karimf | 1 comments | | HN request time: 0.198s | source
Show context
bob1029 ◴[] No.45656123[source]
Why not normal audio codecs? How are JPEG and MP3 (i.e., DCT/MDCT) not a reasonable way to go about tokenizing spatial and time domain signals for these kinds of models?

Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.

I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.

replies(6): >>45656175 #>>45656782 #>>45656867 #>>45657008 #>>45657386 #>>45657808 #
1. WithinReason ◴[] No.45656867[source]
JPEG is a good idea:

The simple, elegant approach of training convolutional neural networks (CNNs) directly from RGB pixels has enjoyed overwhelming empirical success. But can more performance be squeezed out of networks by using different input representations? In this paper we propose and explore a simple idea: train CNNs directly on the blockwise discrete cosine transform (DCT) coefficients computed and available in the middle of the JPEG codec. Intuitively, when processing JPEG images using CNNs, it seems unnecessary to decompress a blockwise frequency representation to an expanded pixel representation, shuffle it from CPU to GPU, and then process it with a CNN that will learn something similar to a transform back to frequency representation in its first layers. Why not skip both steps and feed the frequency domain into the network directly? In this paper we modify \libjpeg to produce DCT coefficients directly, modify a ResNet-50 network to accommodate the differently sized and strided input, and evaluate performance on ImageNet. We find networks that are both faster and more accurate, as well as networks with about the same accuracy but 1.77x faster than ResNet-50.

https://proceedings.neurips.cc/paper_files/paper/2018/file/7...

I suspect mp3 is also a good idea