Most active commenters

    ←back to thread

    685 points georgemandis | 26 comments | | HN request time: 0.724s | source | bottom
    Show context
    w-m ◴[] No.44378345[source]
    With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

    In the idea of making more of an OpenAI minute, don't send it any silence.

    E.g.

        ffmpeg -i video-audio.m4a \
          -af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
                             stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
                             apad=pad_dur=0.02" \
          -c:a aac -b:a 128k output_minpause.m4a -y
    
    will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.
    replies(12): >>44378492 #>>44378769 #>>44378939 #>>44378971 #>>44380884 #>>44380906 #>>44381352 #>>44382788 #>>44382864 #>>44384720 #>>44388923 #>>44388970 #
    1. behnamoh ◴[] No.44378939[source]
    > His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

    I wonder if there's a way to automatically detect how "fast" a person talks in an audio file. I know it's subjective and different people talk at different paces in an audio, but it'd be cool to kinda know when OP's trick fails (they mention x4 ruined the output; maybe for karpathy that would happen at x2).

    replies(7): >>44379087 #>>44379461 #>>44379539 #>>44380162 #>>44380831 #>>44383231 #>>44387266 #
    2. echelon ◴[] No.44379087[source]
    > I wonder if there's a way to automatically detect how "fast" a person talks in an audio file.

    Stupid heuristic: take a segment of video, transcribe text, count number of words per utterance duration. If you need speaker diarization, handle speaker utterance durations independently. You can further slice, such as syllable count, etc.

    replies(1): >>44379192 #
    3. nand4011 ◴[] No.44379192[source]
    https://www.science.org/doi/10.1126/sciadv.aaw2594

    Apparently human language conveys information at around 39 bits/s. You could use a similar technique as that paper to determine the information rate of a speaker and then correct it to 39 bits/s by changing the speed of the video.

    4. varispeed ◴[] No.44379461[source]
    It's a shame platforms don't generally support speeds greater than 2x. One of my "superpowers" or a curse is that I cannot stand normal speaking pace. When I watch lectures, I always go for maximum speed and that still is too slow for me. I wish platforms have included 4x but done properly (with minimal artefacts).
    replies(10): >>44379513 #>>44379536 #>>44379612 #>>44379810 #>>44379982 #>>44380594 #>>44380830 #>>44381970 #>>44384356 #>>44387197 #
    5. lofaszvanitt ◴[] No.44379513[source]
    Robot in a human body identified :D.
    6. mrmuagi ◴[] No.44379536[source]
    All audiobooks are like this for me. I tried it for lectures but if I'm taking handwritten notes, I can't keep up my writing.

    I wonder if there is negative side effects of this though, do you notice when interacting with people who speak slower require a greater deal of patience?

    replies(3): >>44379957 #>>44380513 #>>44383539 #
    7. btown ◴[] No.44379539[source]
    Even a last-decade transcription model could be used to detect a rough number of syllables per unit time, and the accuracy of that model could be used to guide speed-up and dead-time detection before sending to a more expensive model. As with all things, it's a question of whether the cost savings justify the engineering work.
    8. dpcx ◴[] No.44379612[source]
    https://github.com/codebicycle/videospeed has been a wonderful addition for me.
    9. ◴[] No.44379810[source]
    10. colechristensen ◴[] No.44379957{3}[source]
    No but a little. I struggle with people who repeat every point of what they're saying to you several times or when you say "you told me exactly this the last time we spoke" they cannot be stopped from retelling the whole thing verbatim. Usually in those situations though there's some potential cognitive issues so you can only be understanding.
    11. narratives1 ◴[] No.44379982[source]
    I use a Chrome extension that lets you take any video player (including embedded) to 10x speed. Turn most things to 3-4x. It works on ads too
    replies(1): >>44380424 #
    12. janalsncm ◴[] No.44380162[source]
    > I wonder if there's a way to automatically detect how "fast" a person talks in an audio file

    Transcribe it locally using whisper and output tokens/sec?

    replies(1): >>44381453 #
    13. munch117 ◴[] No.44380424{3}[source]
    I use a bookmarklet:

    javascript:void%20function(){document.querySelector(%22video,audio%22).playbackRate=parseFloat(prompt(%22Set%20the%20playback rate%22))}();

    14. hamburglar ◴[] No.44380513{3}[source]
    I once attended a live talk by Leslie Lamport and as he talked, I had the overwhelming feeling that something was wrong, and was thinking “did he have a stroke or something?” but then I realized I had just always watched his lectures online and had become accustomed to listening to him at 2x.
    15. cookingrobot ◴[] No.44380594[source]
    There are fonts designed to be legibly at really small size. I wonder if there are voices that are especially understandable at extreme speeds.

    Could use an “auctioneer” voice to playback text at 10x speed.

    replies(1): >>44381442 #
    16. seabass ◴[] No.44380830[source]
    I made a super simplistic chrome extension for this. Doesn’t work on all websites, but YouTube and most online video courses are covered.

    https://github.com/sebastiansandqvist/video-speed-extension

    17. mrstone ◴[] No.44380831[source]
    > I wonder if there's a way to automatically detect how "fast" a person talks in an audio file.

    Hilbert transform and FFT to get phoneme rate would work.

    18. bbatha ◴[] No.44381442{3}[source]
    I'm also a fast listener. I find audio quality is the main differentiator in my ability to listen quickly or not. A podcast recorded at high quality I can listen to at 3-4x (with silence trimmed) comfortably, the second someone calls in from their phone I'm getting every 4th word and often need to go down to 2x or less. Mumbly accents are also a driver of quality but not as much, then again I rarely have trouble understanding difficult accents IRL and almost never use subtitles on TV shows/youtube to better understand the speaker. Your mileage may vary.

    I understand 4-6x speakers fairly well but don't enjoy listening at that pace. If I lose focus for a couple of seconds I effectively miss a paragraph of context and my brain can't fill in the missing details.

    19. maxall4 ◴[] No.44381453[source]
    Just count syllables per second by doing an FFT plus some basic analysis.
    replies(1): >>44386128 #
    20. JadeNB ◴[] No.44381970[source]
    Can't you use VLC to watch almost anything streamable, and then play at your desired speed?
    21. WalterSear ◴[] No.44383231[source]
    Better: just make everyone in the video speak at my comfortable speed.
    22. userbinator ◴[] No.44383539{3}[source]
    I wonder if there is negative side effects of this though, do you notice when interacting with people who speak slower require a greater deal of patience?

    You are basically training your brain to work faster, and I suspect that causes some changes in the structure of your memory; if someone speaks too slowly, I'll be more likely to forget what they said earlier, compared to if they quickly gave me the entire sentence.

    23. ars ◴[] No.44384356[source]
    I use this extension: https://mybrowseraddon.com/video-speed-control.html
    24. tucnak ◴[] No.44386128{3}[source]
    > FFT plus some basic analysis

    Yeah, totally easier than `len(transcribe(a))/len(a)`

    25. eitally ◴[] No.44387197[source]
    Recently, YT started supporting 4x playback for Premium subscribers, but only in the mobile app, not on the web.
    26. dTal ◴[] No.44387266[source]
    Compress it using a VBR speech codec and measure the compression ratio?