OpenAI charges by the minute, so speed up your audio

(george.mand.is)

669 points georgemandis | 2 comments | 25 Jun 25 13:17 UTC | HN request time: 0.486s | source

Show context

w-m ◴[25 Jun 25 15:21 UTC] No.44378345[source]▶

With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

In the idea of making more of an OpenAI minute, don't send it any silence.

E.g.

    ffmpeg -i video-audio.m4a \
      -af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
                         stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
                         apad=pad_dur=0.02" \
      -c:a aac -b:a 128k output_minpause.m4a -y

will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.

replies(12): >>44378492 #>>44378769 #>>44378939 #>>44378971 #>>44380884 #>>44380906 #>>44381352 #>>44382788 #>>44382864 #>>44384720 #>>44388923 #>>44388970 #

behnamoh ◴[25 Jun 25 16:13 UTC] No.44378939[source]▶

>>44378345 #

> His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

I wonder if there's a way to automatically detect how "fast" a person talks in an audio file. I know it's subjective and different people talk at different paces in an audio, but it'd be cool to kinda know when OP's trick fails (they mention x4 ruined the output; maybe for karpathy that would happen at x2).

replies(7): >>44379087 #>>44379461 #>>44379539 #>>44380162 #>>44380831 #>>44383231 #>>44387266 #

varispeed ◴[25 Jun 25 16:57 UTC] No.44379461[source]▶

>>44378939 #

It's a shame platforms don't generally support speeds greater than 2x. One of my "superpowers" or a curse is that I cannot stand normal speaking pace. When I watch lectures, I always go for maximum speed and that still is too slow for me. I wish platforms have included 4x but done properly (with minimal artefacts).

replies(10): >>44379513 #>>44379536 #>>44379612 #>>44379810 #>>44379982 #>>44380594 #>>44380830 #>>44381970 #>>44384356 #>>44387197 #

1. cookingrobot ◴[25 Jun 25 18:44 UTC] No.44380594[source]▶

>>44379461 #

There are fonts designed to be legibly at really small size. I wonder if there are voices that are especially understandable at extreme speeds.

Could use an “auctioneer” voice to playback text at 10x speed.

replies(1): >>44381442 #

2. bbatha ◴[25 Jun 25 20:17 UTC] No.44381442[source]▶

>>44380594 (TP) #

I'm also a fast listener. I find audio quality is the main differentiator in my ability to listen quickly or not. A podcast recorded at high quality I can listen to at 3-4x (with silence trimmed) comfortably, the second someone calls in from their phone I'm getting every 4th word and often need to go down to 2x or less. Mumbly accents are also a driver of quality but not as much, then again I rarely have trouble understanding difficult accents IRL and almost never use subtitles on TV shows/youtube to better understand the speaker. Your mileage may vary.

I understand 4-6x speakers fairly well but don't enjoy listening at that pace. If I lose focus for a couple of seconds I effectively miss a paragraph of context and my brain can't fill in the missing details.

↑