←back to thread

669 points georgemandis | 2 comments | | HN request time: 0.604s | source
1. conjecTech ◴[] No.44384893[source]
If you are hosting whisper yourself, you can do something slightly more elegant, but with the same effect. You can downsample/pool the context 2:1 (or potentially more) a few layers into the encoder. That allows you to do the equivalent of speeding up audio without worry about potential spectral losses. For whisper large v3, that gets you nearly double throughput in exchange for a relative ~4% WER increase.
replies(1): >>44385135 #
2. nomercy400 ◴[] No.44385135[source]
Do you have more details or examples on how to downsample the context in the encoder? I treat the encoder as an opaque block, so I have no idea where to start.