Most active commenters

w-m(3)
userbinator(3)

Popular/hot comments

>>44379461 #
>>44382864 #
>>44378939 #
>>44382788 #
>>44388970 #
>>44378587 #
>>44379536 #
>>44381944 #
>>44384720 #
>>44388493 #

←back to thread

OpenAI charges by the minute, so speed up your audio

(george.mand.is)

1. w-m ◴[25 Jun 25 15:21 UTC] No.44378345[source]▶

>>44376989 (OP) #

With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

In the idea of making more of an OpenAI minute, don't send it any silence.

E.g.

    ffmpeg -i video-audio.m4a \
      -af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
                         stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
                         apad=pad_dur=0.02" \
      -c:a aac -b:a 128k output_minpause.m4a -y

will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.

replies(12): >>44378492 #>>44378769 #>>44378939 #>>44378971 #>>44380884 #>>44380906 #>>44381352 #>>44382788 #>>44382864 #>>44384720 #>>44388923 #>>44388970 #

2. georgemandis ◴[25 Jun 25 15:33 UTC] No.44378492[source]▶

>>44378345 (TP) #

Oooh fun! I had a feeling there was more ffmpeg wizardry I could be leaning into here. I'll have to try this later—thanks for the idea!

replies(1): >>44378587 #

3. w-m ◴[25 Jun 25 15:43 UTC] No.44378587[source]▶

>>44378492 #

In the meantime I realized that the apad part is nonsensical - it pads the end of the stream, not at each silence-removed cut. I wanted to get angry at o3 for proposing this, but then I had a look at the silenceremove= documentation myself: https://ffmpeg.org/ffmpeg-filters.html#silenceremove

Good god. You couldn't make that any more convoluted and hard-to-grasp if you wanted to. You gotta love ffmpeg!

I now think this might be a good solution:

    ffmpeg -i video-audio.m4a \
           -af "silenceremove=start_periods=1:stop_periods=-1:stop_duration=0.15:stop_threshold=-40dB:detection=rms" \
           -c:a aac -b:a 128k output.m4a -y

replies(3): >>44379478 #>>44380996 #>>44388001 #

4. pragmatic ◴[25 Jun 25 16:00 UTC] No.44378769[source]▶

>>44378345 (TP) #

No not really? The talk where he babbles about OSes and everyone is somehow impressed?

5. behnamoh ◴[25 Jun 25 16:13 UTC] No.44378939[source]▶

>>44378345 (TP) #

> His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

I wonder if there's a way to automatically detect how "fast" a person talks in an audio file. I know it's subjective and different people talk at different paces in an audio, but it'd be cool to kinda know when OP's trick fails (they mention x4 ruined the output; maybe for karpathy that would happen at x2).

replies(7): >>44379087 #>>44379461 #>>44379539 #>>44380162 #>>44380831 #>>44383231 #>>44387266 #

6. ◴[25 Jun 25 16:15 UTC] No.44378971[source]▶

>>44378345 (TP) #

7. echelon ◴[25 Jun 25 16:25 UTC] No.44379087[source]▶

>>44378939 #

> I wonder if there's a way to automatically detect how "fast" a person talks in an audio file.

Stupid heuristic: take a segment of video, transcribe text, count number of words per utterance duration. If you need speaker diarization, handle speaker utterance durations independently. You can further slice, such as syllable count, etc.

replies(1): >>44379192 #

8. nand4011 ◴[25 Jun 25 16:35 UTC] No.44379192{3}[source]▶

>>44379087 #

https://www.science.org/doi/10.1126/sciadv.aaw2594

Apparently human language conveys information at around 39 bits/s. You could use a similar technique as that paper to determine the information rate of a speaker and then correct it to 39 bits/s by changing the speed of the video.

9. varispeed ◴[25 Jun 25 16:57 UTC] No.44379461[source]▶

>>44378939 #

It's a shame platforms don't generally support speeds greater than 2x. One of my "superpowers" or a curse is that I cannot stand normal speaking pace. When I watch lectures, I always go for maximum speed and that still is too slow for me. I wish platforms have included 4x but done properly (with minimal artefacts).

replies(10): >>44379513 #>>44379536 #>>44379612 #>>44379810 #>>44379982 #>>44380594 #>>44380830 #>>44381970 #>>44384356 #>>44387197 #

10. snickerdoodle12 ◴[25 Jun 25 16:58 UTC] No.44379478{3}[source]▶

>>44378587 #

I love ffmpeg but the documentation is often close to incomprehensible.

11. lofaszvanitt ◴[25 Jun 25 17:00 UTC] No.44379513{3}[source]▶

>>44379461 #

Robot in a human body identified :D.

12. mrmuagi ◴[25 Jun 25 17:02 UTC] No.44379536{3}[source]▶

>>44379461 #

All audiobooks are like this for me. I tried it for lectures but if I'm taking handwritten notes, I can't keep up my writing.

I wonder if there is negative side effects of this though, do you notice when interacting with people who speak slower require a greater deal of patience?

replies(3): >>44379957 #>>44380513 #>>44383539 #

13. btown ◴[25 Jun 25 17:02 UTC] No.44379539[source]▶

>>44378939 #

Even a last-decade transcription model could be used to detect a rough number of syllables per unit time, and the accuracy of that model could be used to guide speed-up and dead-time detection before sending to a more expensive model. As with all things, it's a question of whether the cost savings justify the engineering work.

14. dpcx ◴[25 Jun 25 17:09 UTC] No.44379612{3}[source]▶

>>44379461 #

https://github.com/codebicycle/videospeed has been a wonderful addition for me.

15. ◴[25 Jun 25 17:29 UTC] No.44379810{3}[source]▶

>>44379461 #

16. colechristensen ◴[25 Jun 25 17:44 UTC] No.44379957{4}[source]▶

>>44379536 #

No but a little. I struggle with people who repeat every point of what they're saying to you several times or when you say "you told me exactly this the last time we spoke" they cannot be stopped from retelling the whole thing verbatim. Usually in those situations though there's some potential cognitive issues so you can only be understanding.

17. narratives1 ◴[25 Jun 25 17:48 UTC] No.44379982{3}[source]▶

>>44379461 #

I use a Chrome extension that lets you take any video player (including embedded) to 10x speed. Turn most things to 3-4x. It works on ads too

replies(1): >>44380424 #

18. janalsncm ◴[25 Jun 25 18:04 UTC] No.44380162[source]▶

>>44378939 #

> I wonder if there's a way to automatically detect how "fast" a person talks in an audio file

Transcribe it locally using whisper and output tokens/sec?

replies(1): >>44381453 #

19. munch117 ◴[25 Jun 25 18:27 UTC] No.44380424{4}[source]▶

>>44379982 #

I use a bookmarklet:

javascript:void%20function(){document.querySelector(%22video,audio%22).playbackRate=parseFloat(prompt(%22Set%20the%20playback rate%22))}();

20. hamburglar ◴[25 Jun 25 18:36 UTC] No.44380513{4}[source]▶

>>44379536 #

I once attended a live talk by Leslie Lamport and as he talked, I had the overwhelming feeling that something was wrong, and was thinking “did he have a stroke or something?” but then I realized I had just always watched his lectures online and had become accustomed to listening to him at 2x.

21. cookingrobot ◴[25 Jun 25 18:44 UTC] No.44380594{3}[source]▶

>>44379461 #

There are fonts designed to be legibly at really small size. I wonder if there are voices that are especially understandable at extreme speeds.

Could use an “auctioneer” voice to playback text at 10x speed.

replies(1): >>44381442 #

22. seabass ◴[25 Jun 25 19:06 UTC] No.44380830{3}[source]▶

>>44379461 #

I made a super simplistic chrome extension for this. Doesn’t work on all websites, but YouTube and most online video courses are covered.

https://github.com/sebastiansandqvist/video-speed-extension

23. mrstone ◴[25 Jun 25 19:06 UTC] No.44380831[source]▶

>>44378939 #

> I wonder if there's a way to automatically detect how "fast" a person talks in an audio file.

Hilbert transform and FFT to get phoneme rate would work.

24. brunoborges ◴[25 Jun 25 19:12 UTC] No.44380884[source]▶

>>44378345 (TP) #

The interesting thing here is that OpenAI likely has a layer that trims down videos exactly how you suggest, so they can still charge by the full length while costing less for them to actually process the content.

25. cbsmith ◴[25 Jun 25 19:14 UTC] No.44380906[source]▶

>>44378345 (TP) #

That's an amusing perspective. I really struggle with watching any video at double speed, but I've never had trouble listening to any of his talks at 1x. To me, he seems to speak at a perfectly reasonable pace.

26. squigz ◴[25 Jun 25 19:26 UTC] No.44380996{3}[source]▶

>>44378587 #

Out of curiosity, how might you improve those docs? They seem fairly reasonable to me

replies(1): >>44381944 #

27. swyx ◴[25 Jun 25 20:07 UTC] No.44381352[source]▶

>>44378345 (TP) #

> I didn't look at all at the quality of the transcription by feeding it the shorter version.

guys how hard is it to toss both versions into like diffchecker or something haha youre just comparing text

replies(1): >>44382070 #

28. bbatha ◴[25 Jun 25 20:17 UTC] No.44381442{4}[source]▶

>>44380594 #

I'm also a fast listener. I find audio quality is the main differentiator in my ability to listen quickly or not. A podcast recorded at high quality I can listen to at 3-4x (with silence trimmed) comfortably, the second someone calls in from their phone I'm getting every 4th word and often need to go down to 2x or less. Mumbly accents are also a driver of quality but not as much, then again I rarely have trouble understanding difficult accents IRL and almost never use subtitles on TV shows/youtube to better understand the speaker. Your mileage may vary.

I understand 4-6x speakers fairly well but don't enjoy listening at that pace. If I lose focus for a couple of seconds I effectively miss a paragraph of context and my brain can't fill in the missing details.

29. maxall4 ◴[25 Jun 25 20:19 UTC] No.44381453{3}[source]▶

>>44380162 #

Just count syllables per second by doing an FFT plus some basic analysis.

replies(1): >>44386128 #

30. w-m ◴[25 Jun 25 21:21 UTC] No.44381944{4}[source]▶

>>44380996 #

The documentation reads like it was written by a programmer who documented the different parameters to their implementation of a specific algorithm. Now when you as the user come along and want to use silenceremove, you'll have to carefully read through this, and build your own mental model of that algorithm, and then you'll be able to set these parameters accordingly. That takes a lot of time and energy, in this case multiple read-throughs and I'd say > 5 minutes.

Good documentation should do this work for you. It should explain somewhat atomic concepts to you, that you can immediately adapt, and compose. Where it already works is for the "detection" and "window" parameters, which are straightforward. But the actions of trimming in the start/middle/end, and how to configure how long the silence lasts before trimming, whether to ignore short bursts of noise, whether to skip every nth silence period, these are all ideas and concepts that get mushed together in 10 parameters which are called start/stop-duration/threshold/silence/mode/periods.

If you want to apply this filter, it takes a long time to build mental models for these 10 parameters. You do have some example calls, which is great, but which doesn't help if you need to adjust any of these - then you probably need to understand them all.

Some stuff I stumbled over when reading it:

"To remove silence from the middle of a file, specify a stop_periods that is negative. This value is then treated as a positive value [...]" - what? Why is this parameter so heavily overloaded?

"start_duration: Specify the amount of time that non-silence must be detected before it stops trimming audio" - parameter is named start_something, but it's about stopping? Why?

"start_periods: [...] Normally, [...] start_periods will be 1 [...]. Default value is 0."

"start_mode: Specify mode of detection of silence end at start": start_mode end at start?

It's very clunky. Every parameter has multiple modes of operation. Why is it start and stop for beginning and end, and why is "do stuff in the middle" part of the end? Why is there no global mode?

You could nitpick this stuff to death. In the end, naming things is famously one of the two hard problems in computer science (the others being cache invalidation and off-by-one errors). And writing good documentation is also very, very hard work. Just exposing the internals of the algorithm is often not great UX, because then every user has to learn how the thing works internally before they can start using it (hey, looking at you, git).

So while it's easy to point out where these docs fail, it would be a lot of work to rewrite this documentation from the top down, explaining the concepts first. Or even rewriting the interface to make this more approachable, and the parameters less overloaded. But since it's hard work, and not sexy to programmers, it won't get done, and many people will come after, having to spend time on reading and re-reading this current mess.

replies(3): >>44386272 #>>44386483 #>>44388611 #

31. JadeNB ◴[25 Jun 25 21:25 UTC] No.44381970{3}[source]▶

>>44379461 #

Can't you use VLC to watch almost anything streamable, and then play at your desired speed?

32. TimorousBestie ◴[25 Jun 25 21:41 UTC] No.44382070[source]▶

>>44381352 #

Why use diffchecker when there’s a perfectly good LLM you could ask right there? lol

replies(2): >>44382758 #>>44383232 #

33. serf ◴[25 Jun 25 23:28 UTC] No.44382758{3}[source]▶

>>44382070 #

because a lot of LLMs will just eat tokens to call a diffchecker.

really it becomes a question of whether or not the friction of invoking the command or the cost of tokens is greater.

as I get older and more rsi'd the tokens seem cheaper.

34. QuantumGood ◴[25 Jun 25 23:32 UTC] No.44382788[source]▶

>>44378345 (TP) #

I wish there was a 2.25x YouTube option for "normal" humans. I already use every shortcut, and listen at 2x 90% of the time. But Andrej I can't take faster than 1.25x

replies(4): >>44383398 #>>44384354 #>>44386815 #>>44388627 #

35. nickjj ◴[25 Jun 25 23:48 UTC] No.44382864[source]▶

>>44378345 (TP) #

Andrej's talk seemed normal to listen at 2x but I've also listened to everything at 2x for a long time.

Unfortunately a byproduct of listening to everything at 2x is I've had a number of folks say they have to watch my videos at 0.75x but even when I play back my own videos it feels painfully slow unless it's 2x.

For reference I've always found John Carmack's pacing perfect / natural and watchable at 2x too.

A recent video of mine is https://www.youtube.com/watch?v=pL-qft1ykek. It was posted on HN by someone else the other day so I'm not trying to do any self promotion here, it's just an example of a recent video I put up and am generally curious if anyone finds that too fast or it's normal. It's a regular unscripted video where I have a rough idea of what I want to cover and then turn on the mic, start recording and let it pan out organically. If I had to guess I'd say the last ~250-300 videos were recorded this way.

replies(10): >>44383021 #>>44383169 #>>44383237 #>>44383507 #>>44383753 #>>44383906 #>>44385284 #>>44386182 #>>44387311 #>>44388274 #

36. noahjk ◴[26 Jun 25 00:11 UTC] No.44383021[source]▶

>>44382864 #

To me you talk at what I would consider "1.2x" of podcast speed (which to me is a decent average measure of spoken word speed - I usually do 1.5x on all podcasts). You're definitely still in the normal distribution for tech YouTubers, in my experience - in fact it feels like a lot of tech YouTube talks like they've had a bit too much adderall, but you don't come off that way. Naturally people may choose to slow down tutorials, because the person giving the tutorial can never truly understand what someone learning would or wouldn't understand. So overall I think your speed is totally fine! Also, very timely video, I was interested in the exact topic, so I'm happy I found this.

replies(1): >>44383355 #

37. SavioMak ◴[26 Jun 25 00:42 UTC] No.44383169[source]▶

>>44382864 #

Yeah, you sound around 1.25-1.5x than the average videos I watch

38. WalterSear ◴[26 Jun 25 00:54 UTC] No.44383231[source]▶

>>44378939 #

Better: just make everyone in the video speak at my comfortable speed.

39. trashchomper ◴[26 Jun 25 00:54 UTC] No.44383232{3}[source]▶

>>44382070 #

Assuming sarcasm but if not, because deterministic vs. nondeterministic output?

replies(2): >>44383836 #>>44388267 #

40. viraptor ◴[26 Jun 25 00:55 UTC] No.44383237[source]▶

>>44382864 #

> Andrej's talk seemed normal to listen at 2x but I've also listened to everything at 2x for a long time.

We get used to higher speeds when we consume a lot of content that way. Have you heard the systems used by experienced blind people? I cannot even understand the words in them, but months of training would probably fix that.

replies(1): >>44383518 #

41. eru ◴[26 Jun 25 01:16 UTC] No.44383355{3}[source]▶

>>44383021 #

> "[I]n fact it feels like a lot of tech YouTube talks like they've had a bit too much adderall, [...]"

Funnily enough, if you actually have ADHD, then stimulants like adderall or even nicotine, will calm you down.

> Naturally people may choose to slow down tutorials, [...]

For me it also depends on what mood I'm in and whether I'm doing anything else at the same time. If I'm fully concentrating on a video, 2x is often fine. If I'm doing some physical task at the same time, I need it slower than that.

If I'm doing a mental task at the same, I can forget about getting anything out of the video. At least, if the mental task involves any words. So eg I could probably still follow along a technical discussion at roughly 1x speed while playing Tetris, but not while coding.

replies(1): >>44383644 #

42. zamadatix ◴[26 Jun 25 01:24 UTC] No.44383398[source]▶

>>44382788 #

YouTube ran an experiment with up to 4x playback on mobile (???) but it went away in February. I get a lot of the experiments they do being experiments but why just allowing the slider to go farther is such a back and forth hoopla is beyond me. It's one of the oft touted features of 3rd party apps and extensions with nearly 0 UI impact to those who don't want to use it (just don't slide the slider past 2x if you don't want past 2x).

https://www.theverge.com/news/603581/youtube-premium-experim...

replies(2): >>44383784 #>>44385490 #

43. userbinator ◴[26 Jun 25 01:46 UTC] No.44383507[source]▶

>>44382864 #

but even when I play back my own videos it feels painfully slow unless it's 2x.

Watching your video at 1x still feels too slow, and it's just right for me at 2x speed (that's approximately how fast I normally talk if others don't tell me to slow down), although my usual YouTube watching speed is closer to 2.5-3x. That is to say, you're still faster than a lot of others.

I think it just takes practice --- I started at around 1.25x for videos, and slowly moved up from there. As you have noticed, once you've consumed enough sped-up content, your own speaking speed will also naturally increase.

44. userbinator ◴[26 Jun 25 01:49 UTC] No.44383518{3}[source]▶

>>44383237 #

You can achieve a similar, less permanent effect by closing your eyes; I often do it when I'm on a call and the person on the other end is extremely difficult to understand.

45. userbinator ◴[26 Jun 25 01:53 UTC] No.44383539{4}[source]▶

>>44379536 #

I wonder if there is negative side effects of this though, do you notice when interacting with people who speak slower require a greater deal of patience?

You are basically training your brain to work faster, and I suspect that causes some changes in the structure of your memory; if someone speaks too slowly, I'll be more likely to forget what they said earlier, compared to if they quickly gave me the entire sentence.

46. Tyr42 ◴[26 Jun 25 02:15 UTC] No.44383644{4}[source]▶

>>44383355 #

Driving is a hard 1.0 for me. But otherwise 2.0 is good.

47. fuzztester ◴[26 Jun 25 02:34 UTC] No.44383753[source]▶

>>44382864 #

James Goodnight of SAS Institute:

https://en.m.wikipedia.org/wiki/James_Goodnight

I have watched one or two videos of his, and he spoke slowly, compared to the average person. I liked that. It sounded good.

48. K2L8M11N2 ◴[26 Jun 25 02:42 UTC] No.44383784{3}[source]▶

>>44383398 #

As a premium subscriber I currently have 4x available on Android and they recently (in the last month) added it to web too

49. TimorousBestie ◴[26 Jun 25 02:57 UTC] No.44383836{4}[source]▶

>>44383232 #

Not sarcasm, just a little joke. I thought the emote at the end would prevent it from being taken seriously. . .

50. makeitdouble ◴[26 Jun 25 03:18 UTC] No.44383906[source]▶

>>44382864 #

Your video sounded a tad fast at 2x and pretty fine at 1.5.

Now I think speed adjustment come less from the natural speaking pace of the person than the subject matter.

I'm thinking of a channel like Accented Cinema (https://youtu.be/hfruMPONaYg), with a slowish talking pace, but as there's all the visual part going on at all times, it actually doesn't feel slow to my ear.

I felt the same for videos explaining concept I have no familiarity with, so I see as how fast the brain can process the info, less than the talking speed per se.

51. ars ◴[26 Jun 25 05:12 UTC] No.44384354[source]▶

>>44382788 #

Install this: https://mybrowseraddon.com/video-speed-control.html

I listen to a lot of videos on 3 or even 4x.

52. ars ◴[26 Jun 25 05:13 UTC] No.44384356{3}[source]▶

>>44379461 #

I use this extension: https://mybrowseraddon.com/video-speed-control.html

53. jwrallie ◴[26 Jun 25 06:30 UTC] No.44384720[source]▶

>>44378345 (TP) #

From my own experience with whisper.cpp, normalizing the audio and removing silence not only shortens the process time significantly, but also increases a lot the quality of the transcription, as silence can mean hallucinations. You can do that graphically with Audacity too, if you do not want to deal with the command line. You also do not need any special hardware to run whisper.cpp, with the small model literally any computer should be able to do it if you can wait a bit (less than the audio length).

One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.

replies(3): >>44384975 #>>44385016 #>>44388493 #

54. d1sxeyes ◴[26 Jun 25 07:12 UTC] No.44384975[source]▶

>>44384720 #

1/3 of the meeting is silence? That’s a good thing. It’s allowing people time to think over what they’re hearing, there are pauses to allow people to contribute or participate. What do you think a better percentage of silent time would be?

replies(1): >>44386518 #

55. sudhirj ◴[26 Jun 25 07:21 UTC] No.44385016[source]▶

>>44384720 #

If a human meeting had lot of silence (assuming it's between words and not before / after), I would consider it a very efficient meeting where there was just enough information exchanged with adequate absorption, processing and response time.

56. retsibsi ◴[26 Jun 25 08:14 UTC] No.44385284[source]▶

>>44382864 #

Your speaking speed is noticeably faster than usual, but I think it's good for this kind of video. When the content is really dense and every word is chosen for maximum information value, a slower speed would be good, but for relatively natural speech with a normal amount of redundancy I think it's fine to go at this speed.

57. zelphirkalt ◴[26 Jun 25 08:51 UTC] No.44385490{3}[source]▶

>>44383398 #

Probably, because they are "A/B testing" things, that do not really show much effect or depend on more circumstances, than they care to eliminate and then overinterpret the results. Like almost all corporate A/B testing.

58. tucnak ◴[26 Jun 25 10:57 UTC] No.44386128{4}[source]▶

>>44381453 #

> FFT plus some basic analysis

Yeah, totally easier than `len(transcribe(a))/len(a)`

replies(1): >>44392295 #

59. quietbritishjim ◴[26 Jun 25 11:08 UTC] No.44386182[source]▶

>>44382864 #

Your actual speed of talking sounds a little faster than average but not notably so.

But it feels (very subjectively) faster to me than usual because you don't really seem to take any pauses. It's like the whole video is a single run-on sentence that I keep buffering, but I never get a chance to process it and flush the buffer.

60. phito ◴[26 Jun 25 11:21 UTC] No.44386272{5}[source]▶

>>44381944 #

> naming things is famously one of the two hard problems in computer science

Isn't ffmpeg made by a French person? As a francophone myself, I can tell you one of the biggest weakness of francophone programmers is naming things, even worse when it's in English. Maybe it's what's at play here.

61. ada1981 ◴[26 Jun 25 11:50 UTC] No.44386483{5}[source]▶

>>44381944 #

Curious if this is helpful.

https://claude.ai/public/artifacts/96ea8227-48c3-484d-b30b-6...

I had Claude rewrite the documentation for silenceremove based on your feedback.

62. jwrallie ◴[26 Jun 25 11:54 UTC] No.44386518{3}[source]▶

>>44384975 #

Good point, somehow if I think of a 30 minutes meeting, 10 minutes of silence sounds great, but seeing a 1 hour block disappear from a 3 hour recording makes me want to use that “free” hour to do something else.

Well, I don’t think silence is not the real problem with a 3 hour meeting!

replies(1): >>44386995 #

63. david_allison ◴[26 Jun 25 12:35 UTC] No.44386815[source]▶

>>44382788 #

I have up to 4x (in steps of 0.05) with YouTube Premium on Android

64. literalAardvark ◴[26 Jun 25 12:58 UTC] No.44386995{4}[source]▶

>>44386518 #

If people could speak continuously for an entire meeting then that meeting would be better off as an email. Meetings are for bouncing half formed ideas around and coagulating that into something greater.

There MUST be time to think

65. eitally ◴[26 Jun 25 13:25 UTC] No.44387197{3}[source]▶

>>44379461 #

Recently, YT started supporting 4x playback for Premium subscribers, but only in the mobile app, not on the web.

66. dTal ◴[26 Jun 25 13:34 UTC] No.44387266[source]▶

>>44378939 #

Compress it using a VBR speech codec and measure the compression ratio?

67. fortran77 ◴[26 Jun 25 13:40 UTC] No.44387311[source]▶

>>44382864 #

I always listen to YouTube and podcasts at 1.5. And when I meet a YouTuber/podcaster IRL, I’m always annoyed at how slow they speak.

68. dylan604 ◴[26 Jun 25 14:48 UTC] No.44388001{3}[source]▶

>>44378587 #

if you did it in 2 passes, you could find the cut points using silence detect, use a bunch of -ss/-t/-i based on those segments, apad each segment with a -filter_complex chain the ends in concating. it would be a wonderfully gnarly command for very little benefit. but it could be done

69. Der_Einzige ◴[26 Jun 25 15:20 UTC] No.44388267{4}[source]▶

>>44383232 #

Make it semi deterministic with structured/constrained generation!

70. Der_Einzige ◴[26 Jun 25 15:20 UTC] No.44388274[source]▶

>>44382864 #

This btw is also why spreading (speed reading) happens in American competitive debate. This gets ridiculed online but it's exactly why it happens.

https://en.wikipedia.org/wiki/Spreading_(debate)

replies(1): >>44389552 #

71. dogprez ◴[26 Jun 25 15:43 UTC] No.44388493[source]▶

>>44384720 #

Others pointed out the value of silence, but I just wanted to say it saddens me when humanity is misclassified as inefficiency. The other day Sam Altman made a jest about how much energy is wasted by people saying "thanks" to chatgpt. The corollary is how much human energy is wasted on humans saying thanks to each other. When making a judgement about inefficiency one is making a judgement on what is valuable, a very biased judgement that isn't necessarily aligned with what makes us thrive. =) (<-- a wasteful smiley)

replies(3): >>44389094 #>>44389488 #>>44390288 #

72. zahlman ◴[26 Jun 25 15:54 UTC] No.44388611{5}[source]▶

>>44381944 #

> "start_mode: Specify mode of detection of silence end at start": start_mode end at start?

In "start_mode", "start" means "initial", and "mode" means "method". But specifically, it's a method of figuring out where the silence ends.

> In the end, naming things is famously one of the two hard problems in computer science

It's also one of the hard problems in English.

73. zahlman ◴[26 Jun 25 15:56 UTC] No.44388627[source]▶

>>44382788 #

Meanwhile, I've found that just reading the transcript is often good enough.

74. vayup ◴[26 Jun 25 16:29 UTC] No.44388923[source]▶

>>44378345 (TP) #

Gemini charges by tokens rather than minutes. I used VAD to trim silence hoping token count will go down. I noticed the token count wasn't much different (Eg: 30 seconds of background noise had the same count as 2s of background noise). Either Gemini API trims silence under the hood, or the nature of tokenization is dependent on speech content rather than the length. Not sure which.

In either case, I bet OpenAI is doing the same optimization under the hood and keeping the savings for themselves.

75. CSMastermind ◴[26 Jun 25 16:34 UTC] No.44388970[source]▶

>>44378345 (TP) #

> to set your YouTube speed back down to 1x

Is it common for people to watch Youtube sped up?

I've heard of people doing this for podcasts and audiobooks and never understood it all that much there. Just feels like 'skimming' a real book instead of actually reading it.

replies(4): >>44389007 #>>44389010 #>>44389033 #>>44389268 #

76. Feathercrown ◴[26 Jun 25 16:39 UTC] No.44389007[source]▶

>>44388970 #

Some people talk slower than your natural listening speed. It's less like skimming and more like if some books used 36pt font and you normalized the size back down to a comfortable information-dense size.

77. Eezee ◴[26 Jun 25 16:39 UTC] No.44389010[source]▶

>>44388970 #

That's completely different. Imagine you are reading a book and the words only get revealed to you at 1 word a second. You would get annoyed if your natural reading speed was higher than that.

Same with a video. A lot of people speak considerably slower than you could process the information they are conveying, so you speed it up. You still get the same content and are not skipping parts as you would when skimming a book.

78. keithxm23 ◴[26 Jun 25 16:42 UTC] No.44389033[source]▶

>>44388970 #

Often, I'll come across speakers who just speak slowly and listening at 1.5x or 2x barely feels sped-up.

Additionally, the brain tends to adjust to a faster talking speed very quickly. If I'm watching an average-paced person talk and speed them up by 2x, the first couple minutes of listening might be difficult and will require more intent-listening. However, the brain starts processing it as the new normal and it does not feel sped-up anymore. To the extent that if I go back to 1x, it feels like the speaker is way too slow.

79. kristianbrigman ◴[26 Jun 25 16:49 UTC] No.44389094{3}[source]▶

>>44388493 #

I’ll remember that you told me thanks. Will chatgpt? (Honestly curious… it’s possible)

replies(2): >>44389146 #>>44389542 #

80. Salgat ◴[26 Jun 25 16:56 UTC] No.44389146{4}[source]▶

>>44389094 #

I say thanks for my own well-being too.

81. 83 ◴[26 Jun 25 17:11 UTC] No.44389268[source]▶

>>44388970 #

>>Just feels like 'skimming' a real book instead of actually reading it.

That's the goal for me lately. I primarily use Youtube for technical assistance (where are the screws to adjust this carburetor?, how do I remove this brake hub?, etc). There used to be short 1 to 2m videos on this kind of stuff but nowadays I have to suffer through a 10-15 minute video with multiple ad breaks.

So now I always watch youtube at 2x speed while rapidly jumping the slider forward to find relevant portions.

82. Philip-J-Fry ◴[26 Jun 25 17:35 UTC] No.44389488{3}[source]▶

>>44388493 #

Well, humans saying thanks to eachother isn't wasted energy. It has a real affect on our relationships.

People say thank you to AI because they are portrayed as human-like chat bots, but in reality it has almost no effect on their effectiveness to respond to our queries.

Saying thank you to ChatGPT is no less wasteful than saying thank you to Windows for opening the calculator.

I don't think anyone is trying to draw any parallels between that inefficiency and real humans saying thank you?

83. rz2k ◴[26 Jun 25 17:40 UTC] No.44389542{4}[source]▶

>>44389094 #

I get the impression that it sets a tone that encourages creative, more open ended responses.

I think this is the reverse of confrontation with the LLM. Typically if you get a really dumb response, it is better to hang up the conversation and completely start over than it is to tell the LLM why it is wrong. Once you start arguing, they start getting stupider and respond with even faultier logic as they try to appease you.

I suppose it makes sense if the training involves alternate models of discourse resembling two educated people in a forum with shared intellectual curiosity and a common goal, or two people having a ridiculous internet argument.

84. hooverd ◴[26 Jun 25 17:41 UTC] No.44389552{3}[source]▶

>>44388274 #

They should put an upper WPM on competitive debate, like F1 does with certain car parts.

85. mulmen ◴[26 Jun 25 19:01 UTC] No.44390288{3}[source]▶

>>44388493 #

Humans are inefficient. The mistake is making a moral judgement about that.

86. janalsncm ◴[26 Jun 25 23:03 UTC] No.44392295{5}[source]▶

>>44386128 #

Maybe not as quick to code up but way faster to calculate.

The tokens/second can be used as ground truth labels for a fft->small neural net model.

↑