OpenAI charges by the minute, so speed up your audio

(george.mand.is)

693 points georgemandis | 1 comments | 25 Jun 25 13:17 UTC | HN request time: 0.208s | source

Show context

w-m ◴[25 Jun 25 15:21 UTC] No.44378345[source]▶

With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

In the idea of making more of an OpenAI minute, don't send it any silence.

E.g.

    ffmpeg -i video-audio.m4a \
      -af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
                         stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
                         apad=pad_dur=0.02" \
      -c:a aac -b:a 128k output_minpause.m4a -y

will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.

replies(12): >>44378492 #>>44378769 #>>44378939 #>>44378971 #>>44380884 #>>44380906 #>>44381352 #>>44382788 #>>44382864 #>>44384720 #>>44388923 #>>44388970 #

jwrallie ◴[26 Jun 25 06:30 UTC] No.44384720[source]▶

>>44378345 #

From my own experience with whisper.cpp, normalizing the audio and removing silence not only shortens the process time significantly, but also increases a lot the quality of the transcription, as silence can mean hallucinations. You can do that graphically with Audacity too, if you do not want to deal with the command line. You also do not need any special hardware to run whisper.cpp, with the small model literally any computer should be able to do it if you can wait a bit (less than the audio length).

One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.

replies(3): >>44384975 #>>44385016 #>>44388493 #

d1sxeyes ◴[26 Jun 25 07:12 UTC] No.44384975[source]▶

>>44384720 #

1/3 of the meeting is silence? That’s a good thing. It’s allowing people time to think over what they’re hearing, there are pauses to allow people to contribute or participate. What do you think a better percentage of silent time would be?

replies(1): >>44386518 #

jwrallie ◴[26 Jun 25 11:54 UTC] No.44386518[source]▶

>>44384975 #

Good point, somehow if I think of a 30 minutes meeting, 10 minutes of silence sounds great, but seeing a 1 hour block disappear from a 3 hour recording makes me want to use that “free” hour to do something else.

Well, I don’t think silence is not the real problem with a 3 hour meeting!

replies(1): >>44386995 #

1. literalAardvark ◴[26 Jun 25 12:58 UTC] No.44386995[source]▶

>>44386518 #

If people could speak continuously for an entire meeting then that meeting would be better off as an email. Meetings are for bouncing half formed ideas around and coagulating that into something greater.

There MUST be time to think

↑