OpenAI charges by the minute, so speed up your audio

(george.mand.is)

671 points georgemandis | 1 comments | 25 Jun 25 13:17 UTC | HN request time: 0s | source

Show context

w-m ◴[25 Jun 25 15:21 UTC] No.44378345[source]▶

With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

In the idea of making more of an OpenAI minute, don't send it any silence.

E.g.

    ffmpeg -i video-audio.m4a \
      -af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
                         stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
                         apad=pad_dur=0.02" \
      -c:a aac -b:a 128k output_minpause.m4a -y

will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.

replies(12): >>44378492 #>>44378769 #>>44378939 #>>44378971 #>>44380884 #>>44380906 #>>44381352 #>>44382788 #>>44382864 #>>44384720 #>>44388923 #>>44388970 #

georgemandis ◴[25 Jun 25 15:33 UTC] No.44378492[source]▶

>>44378345 #

Oooh fun! I had a feeling there was more ffmpeg wizardry I could be leaning into here. I'll have to try this later—thanks for the idea!

replies(1): >>44378587 #

w-m ◴[25 Jun 25 15:43 UTC] No.44378587[source]▶

>>44378492 #

In the meantime I realized that the apad part is nonsensical - it pads the end of the stream, not at each silence-removed cut. I wanted to get angry at o3 for proposing this, but then I had a look at the silenceremove= documentation myself: https://ffmpeg.org/ffmpeg-filters.html#silenceremove

Good god. You couldn't make that any more convoluted and hard-to-grasp if you wanted to. You gotta love ffmpeg!

I now think this might be a good solution:

    ffmpeg -i video-audio.m4a \
           -af "silenceremove=start_periods=1:stop_periods=-1:stop_duration=0.15:stop_threshold=-40dB:detection=rms" \
           -c:a aac -b:a 128k output.m4a -y

replies(3): >>44379478 #>>44380996 #>>44388001 #

squigz ◴[25 Jun 25 19:26 UTC] No.44380996[source]▶

>>44378587 #

Out of curiosity, how might you improve those docs? They seem fairly reasonable to me

replies(1): >>44381944 #

w-m ◴[25 Jun 25 21:21 UTC] No.44381944[source]▶

>>44380996 #

The documentation reads like it was written by a programmer who documented the different parameters to their implementation of a specific algorithm. Now when you as the user come along and want to use silenceremove, you'll have to carefully read through this, and build your own mental model of that algorithm, and then you'll be able to set these parameters accordingly. That takes a lot of time and energy, in this case multiple read-throughs and I'd say > 5 minutes.

Good documentation should do this work for you. It should explain somewhat atomic concepts to you, that you can immediately adapt, and compose. Where it already works is for the "detection" and "window" parameters, which are straightforward. But the actions of trimming in the start/middle/end, and how to configure how long the silence lasts before trimming, whether to ignore short bursts of noise, whether to skip every nth silence period, these are all ideas and concepts that get mushed together in 10 parameters which are called start/stop-duration/threshold/silence/mode/periods.

If you want to apply this filter, it takes a long time to build mental models for these 10 parameters. You do have some example calls, which is great, but which doesn't help if you need to adjust any of these - then you probably need to understand them all.

Some stuff I stumbled over when reading it:

"To remove silence from the middle of a file, specify a stop_periods that is negative. This value is then treated as a positive value [...]" - what? Why is this parameter so heavily overloaded?

"start_duration: Specify the amount of time that non-silence must be detected before it stops trimming audio" - parameter is named start_something, but it's about stopping? Why?

"start_periods: [...] Normally, [...] start_periods will be 1 [...]. Default value is 0."

"start_mode: Specify mode of detection of silence end at start": start_mode end at start?

It's very clunky. Every parameter has multiple modes of operation. Why is it start and stop for beginning and end, and why is "do stuff in the middle" part of the end? Why is there no global mode?

You could nitpick this stuff to death. In the end, naming things is famously one of the two hard problems in computer science (the others being cache invalidation and off-by-one errors). And writing good documentation is also very, very hard work. Just exposing the internals of the algorithm is often not great UX, because then every user has to learn how the thing works internally before they can start using it (hey, looking at you, git).

So while it's easy to point out where these docs fail, it would be a lot of work to rewrite this documentation from the top down, explaining the concepts first. Or even rewriting the interface to make this more approachable, and the parameters less overloaded. But since it's hard work, and not sexy to programmers, it won't get done, and many people will come after, having to spend time on reading and re-reading this current mess.

replies(3): >>44386272 #>>44386483 #>>44388611 #

1. ada1981 ◴[26 Jun 25 11:50 UTC] No.44386483[source]▶

>>44381944 #

Curious if this is helpful.

https://claude.ai/public/artifacts/96ea8227-48c3-484d-b30b-6...

I had Claude rewrite the documentation for silenceremove based on your feedback.

↑