(george.mand.is)

669 points georgemandis | 2 comments | 25 Jun 25 13:17 UTC | HN request time: 0.604s | source

1. conjecTech ◴[26 Jun 25 06:56 UTC] No.44384893[source]▶

If you are hosting whisper yourself, you can do something slightly more elegant, but with the same effect. You can downsample/pool the context 2:1 (or potentially more) a few layers into the encoder. That allows you to do the equivalent of speeding up audio without worry about potential spectral losses. For whisper large v3, that gets you nearly double throughput in exchange for a relative ~4% WER increase.

replies(1): >>44385135 #

2. nomercy400 ◴[26 Jun 25 07:42 UTC] No.44385135[source]▶

>>44384893 (TP) #

Do you have more details or examples on how to downsample the context in the encoder? I treat the encoder as an opaque block, so I have no idea where to start.

↑

OpenAI charges by the minute, so speed up your audio