Most active commenters

georgemandis(10)
rob(4)
colechristensen(4)
bravesoul2(3)
w-m(3)
fallinditch(3)
userbinator(3)

Popular/hot comments

>>44378345 #
>>44379461 #
>>44382864 #
>>44378250 #
>>44378939 #
>>44379019 #
>>44378391 #
>>44378167 #
>>44379183 #
>>44378501 #
>>44379536 #
>>44379755 #
>>44382788 #

OpenAI charges by the minute, so speed up your audio

(george.mand.is)

1. georgemandis ◴[25 Jun 25 13:17 UTC] No.44376990[source]▶

I was trying to summarize a 40-minute talk with OpenAI’s transcription API, but it was too long. So I sped it up with ffmpeg to fit within the 25-minute cap. It worked quite well (Up to 3x speeds) and was cheaper and faster, so I wrote about it.

Felt like a fun trick worth sharing. There’s a full script and cost breakdown.

replies(1): >>44378167 #

2. ada1981 ◴[25 Jun 25 14:40 UTC] No.44377893[source]▶

>>44376989 (OP) #

We discovered this last month.

There is also prob a way to send a smaller sampler of audio at diff speeds and compare them to get a speed optimization with no quality loss unique for each clip.

replies(2): >>44378355 #>>44378780 #

3. brendanfinan ◴[25 Jun 25 14:41 UTC] No.44377908[source]▶

>>44376989 (OP) #

would this also work for my video consisting of 10,000 PDFs?

https://news.ycombinator.com/item?id=44125598

replies(2): >>44378308 #>>44384623 #

4. mcc1ane ◴[25 Jun 25 15:00 UTC] No.44378132[source]▶

>>44376989 (OP) #

Longer*

replies(1): >>44383424 #

5. simonw ◴[25 Jun 25 15:02 UTC] No.44378164[source]▶

>>44376989 (OP) #

There was a similar trick which worked with Gemini versions prior to Gemini 2.0: they charged a flat rate of 258 tokens for an image, and it turns out you could fit more than 258 tokens of text in an image of text and use that for a discount!

replies(1): >>44386602 #

6. bravesoul2 ◴[25 Jun 25 15:03 UTC] No.44378167[source]▶

>>44376990 #

You could have kept quiet and started a cheaper than openai transcription business :)

replies(4): >>44378890 #>>44379081 #>>44379840 #>>44380550 #

7. heeton ◴[25 Jun 25 15:10 UTC] No.44378250[source]▶

>>44376989 (OP) #

A point on skimming vs taking the time to read something properly.

I read a transcript + summary of that exact talk. I thought it was fine, but uninteresting, I moved on.

Later I saw it had been put on youtube and I was on the train, so I watched the whole thing at normal speed. I had a huge number of different ideas, thoughts and decisions, sparked by watching the whole thing.

This happens to me in other areas too. Watching a conference talk in person is far more useful to me than watching it online with other distractions. Watching it online is more useful again than reading a summary.

Going for a walk to think about something deeply beats a 10 minute session to "solve" the problem and forget it.

Slower is usually better for thinking.

replies(6): >>44378391 #>>44378560 #>>44379201 #>>44379324 #>>44379750 #>>44380419 #

8. jasonjmcghee ◴[25 Jun 25 15:16 UTC] No.44378308[source]▶

>>44377908 #

I can't tell if this is a meme or not.

And if someone had this idea and pitched it to Claude (the model this project was vibe coded with) it would be like "what a great idea!"

9. b0a04gl ◴[25 Jun 25 15:17 UTC] No.44378320[source]▶

>>44376989 (OP) #

it's still decoding every frame and matching phonemes either way, but speeding it up reduces how many seconds they bill you for. so you may hack their billing logic more than the model itself.

also means the longer you talk, the more you pay even if the actual info density is the same. so if your voice has longer pauses or you speak slow, you maybe subsidizing inefficiency.

makes me think maybe the next big compression is in delivery cadence. just auto-optimize voice tone and pacing before sending it to LLM. feed it synthetic fast speech with no emotion, just high density words. you lose human warmth but gain 40% cost savings

10. timerol ◴[25 Jun 25 15:19 UTC] No.44378334[source]▶

>>44376989 (OP) #

> Is It Accurate?

> I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.

This is a great bit of work, and the author accurately summarizes my discomfort

replies(2): >>44381178 #>>44384572 #

11. jasonjmcghee ◴[25 Jun 25 15:20 UTC] No.44378337[source]▶

>>44376989 (OP) #

Heads up, the token cost breakdown tables look white on white to me. I'm in dark mode on iOS using Brave.

replies(1): >>44378604 #

12. w-m ◴[25 Jun 25 15:21 UTC] No.44378345[source]▶

>>44376989 (OP) #

With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

In the idea of making more of an OpenAI minute, don't send it any silence.

E.g.

    ffmpeg -i video-audio.m4a \
      -af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
                         stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
                         apad=pad_dur=0.02" \
      -c:a aac -b:a 128k output_minpause.m4a -y

will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.

replies(10): >>44378492 #>>44378769 #>>44378939 #>>44378971 #>>44380884 #>>44380906 #>>44381352 #>>44382788 #>>44382864 #>>44384720 #

13. moralestapia ◴[25 Jun 25 15:22 UTC] No.44378355[source]▶

>>44377893 #

>We discovered this last month.

Nice. Any blog post, twitter comment or anything pointing to that?

14. pluc ◴[25 Jun 25 15:25 UTC] No.44378391[source]▶

>>44378250 #

Seriously this is bonkers to me. I, like many hackers, hated school because they just threw one-size-fits-all knowledge at you and here we are, paying for the privilege to have that in every facet of our lives.

Reading is a pleasure. Watching a lecture or a talk and feeling the pieces fall into place is great. Having your brain work out the meaning of things is surely something that defines us as a species. We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.

replies(4): >>44378472 #>>44379429 #>>44380070 #>>44380311 #

15. babuloseo ◴[25 Jun 25 15:26 UTC] No.44378400[source]▶

>>44376989 (OP) #

I use the youtube trick, will share it here, but upload to youtube and use their built in transcription service to translate to text for you, and than use gemini pro 2.5 to rebuild the transcript.

ffmpeg \ -f lavfi \ -i color=c=black:s=1920x1080:r=5 \ -i file_you_want_transcripted.wav \ -c:v libx264 \ -preset medium \ -tune stillimage \ -crf 28 \ -c:a aac \ -b:a 192k \ -pix_fmt yuv420p \ -shortest \ file_you_upload_to_youtube_for_free_transcripts.mp4

This works VERY well for my needs.

16. KTibow ◴[25 Jun 25 15:28 UTC] No.44378430[source]▶

>>44376989 (OP) #

This is really interesting, although the cheapest route is still to use an alternative audio-compatible LLM (Gemini 2.0 Flash Lite, Phi 4 Multimodal) or an alternative host for Whisper (Deepinfra, Fal).

17. hooverd ◴[25 Jun 25 15:32 UTC] No.44378472{3}[source]▶

>>44378391 #

If you're not listening to summaries of different audiobooks at 2x speed in each ear you're not contentmaxing.

replies(2): >>44382161 #>>44383148 #

18. georgemandis ◴[25 Jun 25 15:33 UTC] No.44378492[source]▶

>>44378345 #

Oooh fun! I had a feeling there was more ffmpeg wizardry I could be leaning into here. I'll have to try this later—thanks for the idea!

replies(1): >>44378587 #

19. fallinditch ◴[25 Jun 25 15:34 UTC] No.44378501[source]▶

>>44376989 (OP) #

When extracting transcripts from YouTube videos, can anyone give advice on the best (cost effective, quick, accurate) way to do this?

I'm confused because I read in various places that the YouTube API doesn't provide access to transcripts ... so how do all these YouTube transcript extractor services do it?

I want to build my own YouTube summarizer app. Any advice and info on this topic greatly appreciated!

replies(3): >>44378546 #>>44379137 #>>44381640 #

20. vjerancrnjak ◴[25 Jun 25 15:39 UTC] No.44378546[source]▶

>>44378501 #

If YouTube placed autogenerated captions you can download them free of charge with yt-dlp.

21. georgemandis ◴[25 Jun 25 15:40 UTC] No.44378560[source]▶

>>44378250 #

For what it's worth, I completely agree with you, for all the reasons you're saying. With talks in particular I think it's seldom about the raw content and ideas presented and more about the ancillary ideas they provoke and inspire, like you're describing.

There is just so much content out there. And context is everything. If the person sharing it had led with some specific ideas or thoughts I might have taken the time to watch and looked for those ideas. But in the context it was received—a quick link with no additional context—I really just wanted the "gist" to know what I was even potentially responding to.

In this case, for me, it was worth it. I can go back and decide if I want to watch it. Your comment has intrigued me so I very well might!

++ to "Slower is usually better for thinking"

22. w-m ◴[25 Jun 25 15:43 UTC] No.44378587{3}[source]▶

>>44378492 #

In the meantime I realized that the apad part is nonsensical - it pads the end of the stream, not at each silence-removed cut. I wanted to get angry at o3 for proposing this, but then I had a look at the silenceremove= documentation myself: https://ffmpeg.org/ffmpeg-filters.html#silenceremove

Good god. You couldn't make that any more convoluted and hard-to-grasp if you wanted to. You gotta love ffmpeg!

I now think this might be a good solution:

    ffmpeg -i video-audio.m4a \
           -af "silenceremove=start_periods=1:stop_periods=-1:stop_duration=0.15:stop_threshold=-40dB:detection=rms" \
           -c:a aac -b:a 128k output.m4a -y

replies(2): >>44379478 #>>44380996 #

23. georgemandis ◴[25 Jun 25 15:44 UTC] No.44378604[source]▶

>>44378337 #

Should be fixed now. Thank you!

24. topaz0 ◴[25 Jun 25 15:47 UTC] No.44378632[source]▶

>>44376989 (OP) #

I have a way that is (all but) free -- just watch the video if you care about it, or decide not to if you don't, and move on with your life.

25. pragmatic ◴[25 Jun 25 16:00 UTC] No.44378769[source]▶

>>44378345 #

No not really? The talk where he babbles about OSes and everyone is somehow impressed?

26. babuloseo ◴[25 Jun 25 16:01 UTC] No.44378780[source]▶

>>44377893 #

source?

27. Tepix ◴[25 Jun 25 16:08 UTC] No.44378872[source]▶

>>44376989 (OP) #

Why would you give up your privacy by sending what interests you to OpenAI when whisper doesn't need that much computer in the first place?

With faster-whisper (int8, batch=8) you can transcripe 13 minutes of audio in 51 seconds on CPU.

replies(2): >>44379759 #>>44384272 #

28. behnamoh ◴[25 Jun 25 16:09 UTC] No.44378890{3}[source]▶

>>44378167 #

Sure, but now the world is a better place because he shared something useful!

29. behnamoh ◴[25 Jun 25 16:13 UTC] No.44378939[source]▶

>>44378345 #

> His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

I wonder if there's a way to automatically detect how "fast" a person talks in an audio file. I know it's subjective and different people talk at different paces in an audio, but it'd be cool to kinda know when OP's trick fails (they mention x4 ruined the output; maybe for karpathy that would happen at x2).

replies(6): >>44379087 #>>44379461 #>>44379539 #>>44380162 #>>44380831 #>>44383231 #

30. ◴[25 Jun 25 16:15 UTC] No.44378971[source]▶

>>44378345 #

31. pimlottc ◴[25 Jun 25 16:16 UTC] No.44378984[source]▶

>>44376989 (OP) #

Appreciated the concise summary + code snippet upfront, followed by more detail and background for those interested. More articles should be written this way!

32. rob ◴[25 Jun 25 16:20 UTC] No.44379019[source]▶

>>44376989 (OP) #

For anybody trying to do this in bulk, instead of using OpenAI's whisper via their API, you can also use Groq [0] which is much cheaper:

[0] https://groq.com/pricing/

Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with whisper-large-v3-turbo. I believe OpenAI comes out to like ~$0.36/hr.

We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube. It uses Groq by default, but I also added support for Replicate and Deepgram as backups because sometimes Groq errors out.

replies(5): >>44379183 #>>44380152 #>>44380182 #>>44381963 #>>44384523 #

33. stogot ◴[25 Jun 25 16:23 UTC] No.44379057[source]▶

>>44376989 (OP) #

Love this idea but the accuracy section is lacking. Couldnt you do a simple diff of the outputs and see how many differences there are? .5% or 5%?

replies(1): >>44379143 #

34. 4b11b4 ◴[25 Jun 25 16:25 UTC] No.44379081{3}[source]▶

>>44378167 #

Pre-processing of the audio still a valid biz, multiple types of pre-processing might be valid

35. echelon ◴[25 Jun 25 16:25 UTC] No.44379087{3}[source]▶

>>44378939 #

> I wonder if there's a way to automatically detect how "fast" a person talks in an audio file.

Stupid heuristic: take a segment of video, transcribe text, count number of words per utterance duration. If you need speaker diarization, handle speaker utterance durations independently. You can further slice, such as syllable count, etc.

replies(1): >>44379192 #

36. rob ◴[25 Jun 25 16:31 UTC] No.44379137[source]▶

>>44378501 #

There's a tool that uses YouTube's unofficial APIs to get them if they're available:

https://github.com/jdepoix/youtube-transcript-api

For our internal tool that transcribes local city council meetings on YouTube (often 1-3 hours long), we found that these automatic ones were never available though.

(Our tool usually 'processes' the videos within ~5-30 mins of being uploaded, so that's also why none are probably available 'officially' yet.)

So we use yt-dlp to download the highest quality audio and then process them with whisper via Groq, which is way cheaper (~$0.02-0.04/hr with Groq compared to $0.36/hr via OpenAI's API.) Sometimes groq errors out so there's built-in support for Replicate and Deepgram as well.

We run yt-dlp on our remote Linode server and I have a Python script I created that will automatically login to YouTube with a "clean" account and extract the proper cookies.txt file, and we also generate a 'po token' using another tool:

https://github.com/iv-org/youtube-trusted-session-generator

Both cookies.txt and the "po token" get passed to yt-dlp when running on the Linode server and I haven't had to re-generate anything in over a month. Runs smoothly every day.

(Note that I don't use cookies/po_token when running locally at home, it usually works fine there.)

replies(1): >>44379768 #

37. georgemandis ◴[25 Jun 25 16:32 UTC] No.44379143[source]▶

>>44379057 #

Yeah, I'd like to do a more formal analysis of the outputs if I can carve out the time.

I don't think a simple diff is the way to go, at least for what I'm interested in. What I care about more is the overall accuracy of the summary—not the word-for-word transcription.

The test I want to setup is using LLMs to evaluate the summarized output and see if the primary themes/topics persist. That's more interesting and useful to me for this exercise.

38. georgemandis ◴[25 Jun 25 16:35 UTC] No.44379183[source]▶

>>44379019 #

Interesting! At $0.02 to $0.04 an hour I don't suspect you've been hunting for optimizations, but I wonder if this "speed up the audio" trick would save you even more.

> We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube

Doesn't YouTube do this for you automatically these days within a day or so?

replies(3): >>44379336 #>>44380033 #>>44380071 #

39. nand4011 ◴[25 Jun 25 16:35 UTC] No.44379192{4}[source]▶

>>44379087 #

https://www.science.org/doi/10.1126/sciadv.aaw2594

Apparently human language conveys information at around 39 bits/s. You could use a similar technique as that paper to determine the information rate of a speaker and then correct it to 39 bits/s by changing the speed of the video.

40. mutagen ◴[25 Jun 25 16:36 UTC] No.44379201[source]▶

>>44378250 #

Not to discount slower speeds for thinking but I wonder if there is also value in dipping into a talk or a subject and then revisiting (re-watching) with the time to ponder on the thoughts a little more deeply.

replies(1): >>44379383 #

41. tmaly ◴[25 Jun 25 16:43 UTC] No.44379291[source]▶

>>44376989 (OP) #

The whisper model weights are free. You could save even more by just using them locally.

replies(1): >>44380262 #

42. conradev ◴[25 Jun 25 16:45 UTC] No.44379324[source]▶

>>44378250 #

Was it the speed or the additional information vended by the audio and video? If someone is a compelling speaker, the same message will be way more effective in an audiovisual format. The audio has emphasis on certain parts of the content, for example, which is missing from the transcript or summary entirely. Video has gestural and facial cues, also often utilized to make a point.

43. rob ◴[25 Jun 25 16:46 UTC] No.44379336{3}[source]▶

>>44379183 #

> Doesn't YouTube do this for you automatically these days within a day or so?

Oh yeah, we do a check first and use youtube-transcript-api if there's an automatic one available:

https://github.com/jdepoix/youtube-transcript-api

The tool usually detects them within like ~5 mins of being uploaded though, so usually none are available yet. Then it'll send the summaries to our internal Slack channel for our editors, in case there's anything interesting to 'follow up on' from the meeting.

Probably would be a good idea to add a delay to it and wait for the automatic ones though :)

44. tass ◴[25 Jun 25 16:50 UTC] No.44379383{3}[source]▶

>>44379201 #

This is similar to strategies in “how to read a book” (Adler).

By understanding the outline and themes of a book (or lecture, I suppose), it makes it easier to piece together thoughts as you delve deeper into the full content.

45. 55555 ◴[25 Jun 25 16:53 UTC] No.44379409[source]▶

>>44376989 (OP) #

This seems like a good place for me to complain about the fact that the automatically generated subtitle files Youtube creates are horribly malformed. Every sentence is repeated twice. In many subtitle files, the subtitle timestamp ranges overlap one another while also repeating every sentence twice in two different ranges. It's absolutely bizarre and has been like this for years or possibly forever. Here's an example - I apologize that it's not in English. I don't know if this issue affects English. https://pastebin.com/raw/LTBps80F

replies(1): >>44383686 #

46. isaacremuant ◴[25 Jun 25 16:54 UTC] No.44379429{3}[source]▶

>>44378391 #

> We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.

Your doomerism and superiority doesn't follow from your initial "I like many hackers don't like one size fits all".

This is literally offering you MANY sizes and you have the freedom to choose. Somehow you're pretending pushed down uniformity.

Consume it however you want and come up with actual criticisms next time?

47. varispeed ◴[25 Jun 25 16:57 UTC] No.44379461{3}[source]▶

>>44378939 #

It's a shame platforms don't generally support speeds greater than 2x. One of my "superpowers" or a curse is that I cannot stand normal speaking pace. When I watch lectures, I always go for maximum speed and that still is too slow for me. I wish platforms have included 4x but done properly (with minimal artefacts).

replies(9): >>44379513 #>>44379536 #>>44379612 #>>44379810 #>>44379982 #>>44380594 #>>44380830 #>>44381970 #>>44384356 #

48. snickerdoodle12 ◴[25 Jun 25 16:58 UTC] No.44379478{4}[source]▶

>>44378587 #

I love ffmpeg but the documentation is often close to incomprehensible.

49. lofaszvanitt ◴[25 Jun 25 17:00 UTC] No.44379513{4}[source]▶

>>44379461 #

Robot in a human body identified :D.

50. amelius ◴[25 Jun 25 17:02 UTC] No.44379528[source]▶

>>44376989 (OP) #

Solution: charge by number of characters generated.

51. mrmuagi ◴[25 Jun 25 17:02 UTC] No.44379536{4}[source]▶

>>44379461 #

All audiobooks are like this for me. I tried it for lectures but if I'm taking handwritten notes, I can't keep up my writing.

I wonder if there is negative side effects of this though, do you notice when interacting with people who speak slower require a greater deal of patience?

replies(3): >>44379957 #>>44380513 #>>44383539 #

52. btown ◴[25 Jun 25 17:02 UTC] No.44379539{3}[source]▶

>>44378939 #

Even a last-decade transcription model could be used to detect a rough number of syllables per unit time, and the accuracy of that model could be used to guide speed-up and dead-time detection before sending to a more expensive model. As with all things, it's a question of whether the cost savings justify the engineering work.

53. dpcx ◴[25 Jun 25 17:09 UTC] No.44379612{4}[source]▶

>>44379461 #

https://github.com/codebicycle/videospeed has been a wonderful addition for me.

54. dataviz1000 ◴[25 Jun 25 17:10 UTC] No.44379634[source]▶

>>44376989 (OP) #

I built a Chrome extension with one feature that transcribes audio to text in the browser using huggingface/transformers.js running the OpenAI Whisper model with WebGPU. It works perfect! Here is a list of examples of all the things you can do in the browser with webgpu for free. [0]

The last thing in the world I want to do is listen or watch presidential social media posts, but, on the other hand, sometimes enormously stupid things are said which move the SP500 up or down $60 in a session. So this feature queries for new posts every minute, does ORC image to text and transcribe video audio to text locally, sends the post with text for analysis, all in the background inside a Chrome extension before notify me of anything economically significant.

[0] https://github.com/huggingface/transformers.js/tree/main/exa...

[1] https://github.com/adam-s/doomberg-terminal

replies(1): >>44380010 #

55. bongodongobob ◴[25 Jun 25 17:22 UTC] No.44379750[source]▶

>>44378250 #

You'd love where I work. Everything is needlessly long bloviating power point meetings that could easily be ingested in a 5 minute email.

56. karpathy ◴[25 Jun 25 17:23 UTC] No.44379755[source]▶

>>44376989 (OP) #

Omg long post. TLDR from an LLM for anyone interested

Speed your audio up 2–3× with ffmpeg before sending it to OpenAI’s gpt-4o-transcribe: the shorter file uses fewer input-tokens, cuts costs by roughly a third, and processes faster with little quality loss (4× is too fast). A sample yt-dlp → ffmpeg → curl script shows the workflow.

;)

replies(3): >>44379800 #>>44379806 #>>44386599 #

57. anigbrowl ◴[25 Jun 25 17:23 UTC] No.44379759[source]▶

>>44378872 #

I came here to ask the same question. This is a well-solved problem, red queen racing it seems utterly pointless, a symptom of reflexive adversarialism.

replies(1): >>44381865 #

58. fallinditch ◴[25 Jun 25 17:24 UTC] No.44379768{3}[source]▶

>>44379137 #

Very useful, thanks. So does this mean that every month or so you have to create a new 'clean' YouTube account and use that to create new po_token/cookies?

It's frustrating to have to jump through all these hoops just to extract transcripts when the YouTube Data API already gives reasonable limits to free API calls ... would be nice if they allowed transcripts too.

Do you think the various YouTube transcript extractor services all follow a similar method as yours?

59. bravesoul2 ◴[25 Jun 25 17:29 UTC] No.44379800[source]▶

>>44379755 #

This is the sort of content I want to see in Tweets and LinkedIn posts.

I have been thinking for a while how do you make good use of the short space in those places.

LLM did well here.

60. georgemandis ◴[25 Jun 25 17:29 UTC] No.44379806[source]▶

>>44379755 #

Hahaha. Okay, okay... I will watch it now ;)

(Thanks for your good sense of humor)

replies(1): >>44379953 #

61. ◴[25 Jun 25 17:29 UTC] No.44379810{4}[source]▶

>>44379461 #

62. hn8726 ◴[25 Jun 25 17:33 UTC] No.44379840{3}[source]▶

>>44378167 #

Or openai will do it themselves for transcription tasks

63. xg15 ◴[25 Jun 25 17:41 UTC] No.44379919[source]▶

>>44376989 (OP) #

That's really cool! Also, isn't this effectively the same as supplying audio with a sampling rate of 8kHz instead of the 16kHz that the model is supposed to work with?

64. karpathy ◴[25 Jun 25 17:44 UTC] No.44379953{3}[source]▶

>>44379806 #

I like that your post deliberately gets to the point first and then (optionally) expands later, I think it's a good and generally underutilized format. I often advise people to structure their emails in the same way, e.g. first just cutting to the chase with the specific ask, then giving more context optionally below.

It's not my intention to bloat information or delivery but I also don't super know how to follow this format especially in this kind of talk. Because it's not so much about relaying specific information (like your final script here), but more as a collection of prompts back to the audience as things to think about.

My companion tweet to this video on X had a brief TLDR/Summary included where I tried, but I didn't super think it was very reflective of the talk, it was more about topics covered.

Anyway, I am overall a big fan of doing more compute at the "creation time" to compress other people's time during "consumption time" and I think it's the respectful and kind thing to do.

replies(2): >>44380436 #>>44382911 #

65. colechristensen ◴[25 Jun 25 17:44 UTC] No.44379957{5}[source]▶

>>44379536 #

No but a little. I struggle with people who repeat every point of what they're saying to you several times or when you say "you told me exactly this the last time we spoke" they cannot be stopped from retelling the whole thing verbatim. Usually in those situations though there's some potential cognitive issues so you can only be understanding.

66. narratives1 ◴[25 Jun 25 17:48 UTC] No.44379982{4}[source]▶

>>44379461 #

I use a Chrome extension that lets you take any video player (including embedded) to 10x speed. Turn most things to 3-4x. It works on ads too

replies(1): >>44380424 #

67. anshumankmr ◴[25 Jun 25 17:49 UTC] No.44379999[source]▶

>>44376989 (OP) #

Someone should try transcribing Eminem's Rap god with this trick.

68. alok-g ◴[25 Jun 25 17:50 UTC] No.44380004[source]▶

>>44376989 (OP) #

>> by jumping straight to the point ...

Love this! I wish more authors follow this approach. So many articles keep going all over the place before 'the point' appears.

If trying, perhaps some 50% of the authors may realize that they don't _have_ a point.

69. kgc ◴[25 Jun 25 17:51 UTC] No.44380010[source]▶

>>44379634 #

Impressive

70. jerjerjer ◴[25 Jun 25 17:53 UTC] No.44380033{3}[source]▶

>>44379183 #

> I wonder if this "speed up the audio" trick would save you even more.

At this point you'll need to at least check how much running ffmpeg costs. Probably less than $0.01 per hour of audio (approximate savings) but still.

71. colechristensen ◴[25 Jun 25 17:56 UTC] No.44380070{3}[source]▶

>>44378391 #

University didn't agree with me mostly because I can't pay attention to the average lecturer. Getting bored in between words or while waiting for them to write means I absorbed very little and had to teach myself nearly everything.

Audiobooks before speed tools were the worst (are they trying to speak extra slow?) But when I can speed things up comprehension is just fine.

72. ks2048 ◴[25 Jun 25 17:56 UTC] No.44380071{3}[source]▶

>>44379183 #

> Doesn't YouTube do this for you automatically these days within a day or so?

Last time I checked, I think the Google auto-captions were noticeably worse quality than whisper, but maybe that has changed.

73. pknerd ◴[25 Jun 25 18:00 UTC] No.44380109[source]▶

>>44376989 (OP) #

I guess it'd work even if you make it 2.5 or evebn 3x.

74. colechristensen ◴[25 Jun 25 18:03 UTC] No.44380152[source]▶

>>44379019 #

If you have a recent macbook you can run the same whisper model locally for free. People are really sleeping on how cheap the compute you own hardware for already is.

replies(2): >>44380229 #>>44384418 #

75. janalsncm ◴[25 Jun 25 18:04 UTC] No.44380162{3}[source]▶

>>44378939 #

> I wonder if there's a way to automatically detect how "fast" a person talks in an audio file

Transcribe it locally using whisper and output tokens/sec?

replies(1): >>44381453 #

76. pzo ◴[25 Jun 25 18:06 UTC] No.44380182[source]▶

>>44379019 #

there is also cloudflare workers ai where you can have whisper-large-v3-turbo for around $0.03 per hour:

https://developers.cloudflare.com/workers-ai/models/whisper-...

77. rob ◴[25 Jun 25 18:10 UTC] No.44380229{3}[source]▶

>>44380152 #

I don't. I have a MacBook Pro from 2019 with an Intel chip and 16 GB of memory. Pretty sure when I tried the large whisper model it took like 30 minutes to an hour to do something that took hardly any time via Groq. It's been a while though so maybe my times are off.

replies(2): >>44380449 #>>44380467 #

78. pzo ◴[25 Jun 25 18:13 UTC] No.44380262[source]▶

>>44379291 #

but this is still great trick if you want to reduce latency or inference speed even with local models e.g. in realtime chatbot

79. bisby ◴[25 Jun 25 18:17 UTC] No.44380311{3}[source]▶

>>44378391 #

> I, like many hackers, hated school because they just threw one-size-fits-all knowledge at you

"This specific knowledge format doesnt work for me, so I'm asking OpenAI to convert this knowledge into a format that is easier for me to digest" is exactly what this is about.

I'm not quite sure what you're upset about? Unless you're referring to "one size fits all knowledge" as simplified topics, so you can tackle things at a surface level? I love having surface level knowledge about a LOT of things. I certainly don't have time to have go deep on every topic out there. But if this is a topic I find I am interested in, the full talk is still available.

Breadth and depth are both important, and well summarized talks are important for breadth, but not helpful at all for depth, and that's ok.

80. itsoktocry ◴[25 Jun 25 18:27 UTC] No.44380419[source]▶

>>44378250 #

>Slower is usually better for thinking.

Yeah, I see people talking about listening to podcasts or audiobooks on 2x or 3x.

Sometimes I set mine to 0.8x. I find you get time to absorb and think. Am I an outlier?

replies(1): >>44383195 #

81. munch117 ◴[25 Jun 25 18:27 UTC] No.44380424{5}[source]▶

>>44379982 #

I use a bookmarklet:

javascript:void%20function(){document.querySelector(%22video,audio%22).playbackRate=parseFloat(prompt(%22Set%20the%20playback rate%22))}();

82. georgemandis ◴[25 Jun 25 18:29 UTC] No.44380436{4}[source]▶

>>44379953 #

I watched your talk. There are so many more interesting ideas in there that resonated with me that the summary (unsurprisingly) skipped over. I'm glad I watched it!

LLMs as the operating system, the way you interface with vibe-coding (smaller chunks) and the idea that maybe we haven't found the "GUI for AI" yet are all things I've pondered and discussed with people. You articulated them well.

I think some formats, like a talk, don't lend themselves easily to meaningful summaries. It's about giving the audience things to think about, to your point. It's the sum of storytelling that's more than the whole and why we still do it.

My post is, at the end of the day, really more about a neat trick to optimize transcriptions. This particular video might be a great example of why you may not always want to do that :)

Anyway, thanks for the time and thanks for the talk!

83. colechristensen ◴[25 Jun 25 18:30 UTC] No.44380449{4}[source]▶

>>44380229 #

Ah, no, Apple silicon Mac required with a decent amount of memory. But this kind of machine has been very common (a mid to high range recent macbook) at all of my employers for a long time.

84. fragmede ◴[25 Jun 25 18:32 UTC] No.44380467{4}[source]▶

>>44380229 #

It's been roughly six years since that MacBook was top of the line, so your times are definitely off.

85. hamburglar ◴[25 Jun 25 18:36 UTC] No.44380513{5}[source]▶

>>44379536 #

I once attended a live talk by Leslie Lamport and as he talked, I had the overwhelming feeling that something was wrong, and was thinking “did he have a stroke or something?” but then I realized I had just always watched his lectures online and had become accustomed to listening to him at 2x.

86. ilyakaminsky ◴[25 Jun 25 18:39 UTC] No.44380550{3}[source]▶

>>44378167 #

I've already done that [1]. A fraction of the price, 24-hour limit per file, and speedup tricks like the OP's are welcome. :)

[1] https://speechischeap.com

replies(1): >>44382158 #

87. cookingrobot ◴[25 Jun 25 18:44 UTC] No.44380594{4}[source]▶

>>44379461 #

There are fonts designed to be legibly at really small size. I wonder if there are voices that are especially understandable at extreme speeds.

Could use an “auctioneer” voice to playback text at 10x speed.

replies(1): >>44381442 #

88. seabass ◴[25 Jun 25 19:06 UTC] No.44380830{4}[source]▶

>>44379461 #

I made a super simplistic chrome extension for this. Doesn’t work on all websites, but YouTube and most online video courses are covered.

https://github.com/sebastiansandqvist/video-speed-extension

89. mrstone ◴[25 Jun 25 19:06 UTC] No.44380831{3}[source]▶

>>44378939 #

> I wonder if there's a way to automatically detect how "fast" a person talks in an audio file.

Hilbert transform and FFT to get phoneme rate would work.

90. brunoborges ◴[25 Jun 25 19:12 UTC] No.44380884[source]▶

>>44378345 #

The interesting thing here is that OpenAI likely has a layer that trims down videos exactly how you suggest, so they can still charge by the full length while costing less for them to actually process the content.

91. cbsmith ◴[25 Jun 25 19:14 UTC] No.44380906[source]▶

>>44378345 #

That's an amusing perspective. I really struggle with watching any video at double speed, but I've never had trouble listening to any of his talks at 1x. To me, he seems to speak at a perfectly reasonable pace.

92. squigz ◴[25 Jun 25 19:26 UTC] No.44380996{4}[source]▶

>>44378587 #

Out of curiosity, how might you improve those docs? They seem fairly reasonable to me

replies(1): >>44381944 #

93. BHSPitMonkey ◴[25 Jun 25 19:44 UTC] No.44381178[source]▶

>>44378334 #

As if human-generated transcriptions of audio ever came with guarantees of accuracy?

This kind of transformation has always come with flaws, and I think that will continue to be expected implicitly. Far more worrying is the public's trust in _interpretations_ and claims of _fact_ produced by gen AI services, or at least the popular idea that "AI" is more trustworthy/unbiased than humans, journalists, experts, etc.

replies(1): >>44383420 #

94. donkey_brains ◴[25 Jun 25 19:58 UTC] No.44381276[source]▶

>>44376989 (OP) #

Hmm…doesn’t this technique effectively make the minute longer, not shorter? Because you can pack more speech into a minute of recording? Seems like making a minute shorter would be counterproductive.

replies(1): >>44381538 #

95. swyx ◴[25 Jun 25 20:07 UTC] No.44381352[source]▶

>>44378345 #

> I didn't look at all at the quality of the transcription by feeding it the shorter version.

guys how hard is it to toss both versions into like diffchecker or something haha youre just comparing text

replies(1): >>44382070 #

96. bbatha ◴[25 Jun 25 20:17 UTC] No.44381442{5}[source]▶

>>44380594 #

I'm also a fast listener. I find audio quality is the main differentiator in my ability to listen quickly or not. A podcast recorded at high quality I can listen to at 3-4x (with silence trimmed) comfortably, the second someone calls in from their phone I'm getting every 4th word and often need to go down to 2x or less. Mumbly accents are also a driver of quality but not as much, then again I rarely have trouble understanding difficult accents IRL and almost never use subtitles on TV shows/youtube to better understand the speaker. Your mileage may vary.

I understand 4-6x speakers fairly well but don't enjoy listening at that pace. If I lose focus for a couple of seconds I effectively miss a paragraph of context and my brain can't fill in the missing details.

97. maxall4 ◴[25 Jun 25 20:19 UTC] No.44381453{4}[source]▶

>>44380162 #

Just count syllables per second by doing an FFT plus some basic analysis.

replies(1): >>44386128 #

98. StochasticLi ◴[25 Jun 25 20:29 UTC] No.44381538[source]▶

>>44381276 #

No. You're paying for a minute of audio, which will be more packed with speech, not for how long it's being computed.

99. banana_giraffe ◴[25 Jun 25 20:44 UTC] No.44381640[source]▶

>>44378501 #

You can use yt-dlp to get the transcripts. For instance, to grab just the transcript of a video:

    ./yt-dlp --skip-download --write-sub --write-auto-sub --sub-lang en --sub-format json3 <youtube video URL>

You can also feed the same command a playlist or channel URL and it'll run through and grab all the transcripts for each video in the playlist or channel.

replies(1): >>44382282 #

100. impossiblefork ◴[25 Jun 25 21:03 UTC] No.44381816[source]▶

>>44376989 (OP) #

Make the minutes longer, you mean.

101. poly2it ◴[25 Jun 25 21:09 UTC] No.44381865{3}[source]▶

>>44379759 #

> symptom of reflexive adversarialism

Is there a definition for this expression? I don't catch you.

> ... using corporate technology for the solved problem is a symptom of self-directed skepticism by the user against the corporate institutions ...

Eh?

102. w-m ◴[25 Jun 25 21:21 UTC] No.44381944{5}[source]▶

>>44380996 #

The documentation reads like it was written by a programmer who documented the different parameters to their implementation of a specific algorithm. Now when you as the user come along and want to use silenceremove, you'll have to carefully read through this, and build your own mental model of that algorithm, and then you'll be able to set these parameters accordingly. That takes a lot of time and energy, in this case multiple read-throughs and I'd say > 5 minutes.

Good documentation should do this work for you. It should explain somewhat atomic concepts to you, that you can immediately adapt, and compose. Where it already works is for the "detection" and "window" parameters, which are straightforward. But the actions of trimming in the start/middle/end, and how to configure how long the silence lasts before trimming, whether to ignore short bursts of noise, whether to skip every nth silence period, these are all ideas and concepts that get mushed together in 10 parameters which are called start/stop-duration/threshold/silence/mode/periods.

If you want to apply this filter, it takes a long time to build mental models for these 10 parameters. You do have some example calls, which is great, but which doesn't help if you need to adjust any of these - then you probably need to understand them all.

Some stuff I stumbled over when reading it:

"To remove silence from the middle of a file, specify a stop_periods that is negative. This value is then treated as a positive value [...]" - what? Why is this parameter so heavily overloaded?

"start_duration: Specify the amount of time that non-silence must be detected before it stops trimming audio" - parameter is named start_something, but it's about stopping? Why?

"start_periods: [...] Normally, [...] start_periods will be 1 [...]. Default value is 0."

"start_mode: Specify mode of detection of silence end at start": start_mode end at start?

It's very clunky. Every parameter has multiple modes of operation. Why is it start and stop for beginning and end, and why is "do stuff in the middle" part of the end? Why is there no global mode?

You could nitpick this stuff to death. In the end, naming things is famously one of the two hard problems in computer science (the others being cache invalidation and off-by-one errors). And writing good documentation is also very, very hard work. Just exposing the internals of the algorithm is often not great UX, because then every user has to learn how the thing works internally before they can start using it (hey, looking at you, git).

So while it's easy to point out where these docs fail, it would be a lot of work to rewrite this documentation from the top down, explaining the concepts first. Or even rewriting the interface to make this more approachable, and the parameters less overloaded. But since it's hard work, and not sexy to programmers, it won't get done, and many people will come after, having to spend time on reading and re-reading this current mess.

replies(2): >>44386272 #>>44386483 #

103. abidlabs ◴[25 Jun 25 21:24 UTC] No.44381963[source]▶

>>44379019 #

You could use Hugging Face's Inference API (which supports all of these API providers) directly making it easier to switch between them, e.g. look at the panel on the right on: https://huggingface.co/openai/whisper-large-v3

104. JadeNB ◴[25 Jun 25 21:25 UTC] No.44381970{4}[source]▶

>>44379461 #

Can't you use VLC to watch almost anything streamable, and then play at your desired speed?

105. TimorousBestie ◴[25 Jun 25 21:41 UTC] No.44382070{3}[source]▶

>>44381352 #

Why use diffchecker when there’s a perfectly good LLM you could ask right there? lol

replies(2): >>44382758 #>>44383232 #

106. pbbakkum ◴[25 Jun 25 21:55 UTC] No.44382153[source]▶

>>44376989 (OP) #

This is great, thank you for sharing. I work on these APIs at OpenAI, it's a surprise to me that it still works reasonably well at 2/3x speed, but on the other hand for phone channels we get 8khz audio that is upsampled to 24khz for the model and it still works well. Note there's probably a measurable decrease in transcription accuracy that worsens as you deviate from 1x speed. Also we really need to support bigger/longer file uploads :)

replies(2): >>44382203 #>>44384158 #

107. bravesoul2 ◴[25 Jun 25 21:55 UTC] No.44382158{4}[source]▶

>>44380550 #

Nice. Don't expect you to spill the beans but is it doing OK (some customers?)

Just wondering if I cam build a retirement out of APIs :)

replies(1): >>44384932 #

108. lovestory ◴[25 Jun 25 21:56 UTC] No.44382161{4}[source]▶

>>44378472 #

Or just use notebookLM to convert your books into an hour long podcasts /s

replies(1): >>44382862 #

109. nerder92 ◴[25 Jun 25 22:02 UTC] No.44382203[source]▶

>>44382153 #

Quick Feedback: Would it be cool to research this internally and maybe find a sweet spot in speed multiplier where the loss is minimal. This pre-processing is quite cheap and could bring down the API price eventually.

110. fallinditch ◴[25 Jun 25 22:14 UTC] No.44382282{3}[source]▶

>>44381640 #

That's cool, thanks for the info. But do you also have to use a rotating proxy to prevent YouTube from blocking your IP address?

replies(1): >>44382408 #

111. celltalk ◴[25 Jun 25 22:24 UTC] No.44382331[source]▶

>>44376989 (OP) #

With this logic, you should also be able to trim the parts that doesn’t have words. Just add a cut-off for db, and trim the video before transcription.

Possibly another 10-20% gain?

112. banana_giraffe ◴[25 Jun 25 22:37 UTC] No.44382408{4}[source]▶

>>44382282 #

Last time I ran this at scale was a couple of months ago, so my information is no doubt out of date, but in my experience, YouTube seems less concerned about this than they are when you're grabbing lots of videos.

But that was a few months ago, so for all I know they've tightened down more hatches since then.

113. isubkhankulov ◴[25 Jun 25 22:50 UTC] No.44382511[source]▶

>>44376989 (OP) #

Transcripts get much more valuable when one diarizes the audio beforehand to determine which speaker said what.

I use this free tool to extract those and dump the transcripts into a LLM with basic prompts: https://contentflow.megalabs.co

114. mt_ ◴[25 Jun 25 23:05 UTC] No.44382623[source]▶

>>44376989 (OP) #

You can just dump the youtube link video in Google AI studio and ask it to transcribe the video with speaker labels and even ask it it to add useful visual clues, because the model is multimodal for video too.

replies(1): >>44383325 #

115. serf ◴[25 Jun 25 23:28 UTC] No.44382758{4}[source]▶

>>44382070 #

because a lot of LLMs will just eat tokens to call a diffchecker.

really it becomes a question of whether or not the friction of invoking the command or the cost of tokens is greater.

as I get older and more rsi'd the tokens seem cheaper.

116. QuantumGood ◴[25 Jun 25 23:32 UTC] No.44382788[source]▶

>>44378345 #

I wish there was a 2.25x YouTube option for "normal" humans. I already use every shortcut, and listen at 2x 90% of the time. But Andrej I can't take faster than 1.25x

replies(3): >>44383398 #>>44384354 #>>44386815 #

117. 0cf8612b2e1e ◴[25 Jun 25 23:47 UTC] No.44382862{5}[source]▶

>>44382161 #

I am genuinely curious how well this would go. There are so many books I “should” read, but will never get around to doing it. A one hour podcast would be more engaging than reading a Wikipedia summary.

On the gripping hand, there are probably already excellent 10/30/60 minute book summaries on YouTube or wherever which are not going to hallucinate plot points.

118. nickjj ◴[25 Jun 25 23:48 UTC] No.44382864[source]▶

>>44378345 #

Andrej's talk seemed normal to listen at 2x but I've also listened to everything at 2x for a long time.

Unfortunately a byproduct of listening to everything at 2x is I've had a number of folks say they have to watch my videos at 0.75x but even when I play back my own videos it feels painfully slow unless it's 2x.

For reference I've always found John Carmack's pacing perfect / natural and watchable at 2x too.

A recent video of mine is https://www.youtube.com/watch?v=pL-qft1ykek. It was posted on HN by someone else the other day so I'm not trying to do any self promotion here, it's just an example of a recent video I put up and am generally curious if anyone finds that too fast or it's normal. It's a regular unscripted video where I have a rough idea of what I want to cover and then turn on the mic, start recording and let it pan out organically. If I had to guess I'd say the last ~250-300 videos were recorded this way.

replies(8): >>44383021 #>>44383169 #>>44383237 #>>44383507 #>>44383753 #>>44383906 #>>44385284 #>>44386182 #

119. mh- ◴[25 Jun 25 23:53 UTC] No.44382911{4}[source]▶

>>44379953 #

> I often advise people to structure their emails [..]

I frequently do the same, and eventually someone sent me this HBR article summarizing the concept nicely as "bottom line up front". It's a good primer for those interested.

https://hbr.org/2016/11/how-to-write-email-with-military-pre...

120. noahjk ◴[26 Jun 25 00:11 UTC] No.44383021{3}[source]▶

>>44382864 #

To me you talk at what I would consider "1.2x" of podcast speed (which to me is a decent average measure of spoken word speed - I usually do 1.5x on all podcasts). You're definitely still in the normal distribution for tech YouTubers, in my experience - in fact it feels like a lot of tech YouTube talks like they've had a bit too much adderall, but you don't come off that way. Naturally people may choose to slow down tutorials, because the person giving the tutorial can never truly understand what someone learning would or wouldn't understand. So overall I think your speed is totally fine! Also, very timely video, I was interested in the exact topic, so I'm happy I found this.

replies(1): >>44383355 #

121. cprayingmantis ◴[26 Jun 25 00:33 UTC] No.44383129[source]▶

>>44376989 (OP) #

I noticed something similar with images as inputs to Claude, you can scale down the images and still get good outputs. There is an accuracy drop off at a certain point but the token savings are worth doing a little tuning there.

122. LanceH ◴[26 Jun 25 00:38 UTC] No.44383148{4}[source]▶

>>44378472 #

Read the title and go.

123. SavioMak ◴[26 Jun 25 00:42 UTC] No.44383169{3}[source]▶

>>44382864 #

Yeah, you sound around 1.25-1.5x than the average videos I watch

124. meerab ◴[26 Jun 25 00:43 UTC] No.44383178[source]▶

>>44376989 (OP) #

Interesting approach to transcript generation!

I'm implementing a similar workflow for VideoToBe.com

My Current Pipeline:

Media Extraction - yt-dlp for reliable video/audio downloads Local Transcription - OpenAI Whisper running on my own hardware (no API costs) Storage & UI - Transcripts stored in S3 with a custom web interface for viewing

Y Combinator playlist https://videotobe.com/play/playlist/ycombinator

and Andrej's talk is https://videotobe.com/play/youtube/LCEmiRjPEtQ

After reading your blog post, I will be testing effect on speeding audio for locally-hosted Whisper models. Running Whisper locally eliminates the ongoing cost concerns since my infrastructure is already a sunk cost. Speeding audio could be an interesting performance enhancement to explore!

125. LanceH ◴[26 Jun 25 00:46 UTC] No.44383195{3}[source]▶

>>44380419 #

Depends on what you're listening to. If it's a recap of something and you're just looking for the answer to "what happened?", that can be fine for 2x. If you're getting into the "why?" maybe slower is better. Or if there are a lot of players involved.

I'm trying to imagine listening to War and Peace faster. On the one hand, there are a lot of threads and people to keep track of (I had a notepad of who is who). On the other hand, having the stories compressed in time might help remember what was going on with a character when finally returning to them.

Listening to something like Dune quickly, someone might come out only thinking of the main political thrusts, and the action, without building that same world in their mind they would if read slower.

126. WalterSear ◴[26 Jun 25 00:54 UTC] No.44383231{3}[source]▶

>>44378939 #

Better: just make everyone in the video speak at my comfortable speed.

127. trashchomper ◴[26 Jun 25 00:54 UTC] No.44383232{4}[source]▶

>>44382070 #

Assuming sarcasm but if not, because deterministic vs. nondeterministic output?

replies(1): >>44383836 #

128. viraptor ◴[26 Jun 25 00:55 UTC] No.44383237{3}[source]▶

>>44382864 #

> Andrej's talk seemed normal to listen at 2x but I've also listened to everything at 2x for a long time.

We get used to higher speeds when we consume a lot of content that way. Have you heard the systems used by experienced blind people? I cannot even understand the words in them, but months of training would probably fix that.

replies(1): >>44383518 #

129. MaxDPS ◴[26 Jun 25 01:09 UTC] No.44383325[source]▶

>>44382623 #

Can I ask what you mean by “useful visual clues”?

replies(1): >>44384694 #

130. eru ◴[26 Jun 25 01:16 UTC] No.44383355{4}[source]▶

>>44383021 #

> "[I]n fact it feels like a lot of tech YouTube talks like they've had a bit too much adderall, [...]"

Funnily enough, if you actually have ADHD, then stimulants like adderall or even nicotine, will calm you down.

> Naturally people may choose to slow down tutorials, [...]

For me it also depends on what mood I'm in and whether I'm doing anything else at the same time. If I'm fully concentrating on a video, 2x is often fine. If I'm doing some physical task at the same time, I need it slower than that.

If I'm doing a mental task at the same, I can forget about getting anything out of the video. At least, if the mental task involves any words. So eg I could probably still follow along a technical discussion at roughly 1x speed while playing Tetris, but not while coding.

replies(1): >>44383644 #

131. zamadatix ◴[26 Jun 25 01:24 UTC] No.44383398{3}[source]▶

>>44382788 #

YouTube ran an experiment with up to 4x playback on mobile (???) but it went away in February. I get a lot of the experiments they do being experiments but why just allowing the slider to go farther is such a back and forth hoopla is beyond me. It's one of the oft touted features of 3rd party apps and extensions with nearly 0 UI impact to those who don't want to use it (just don't slide the slider past 2x if you don't want past 2x).

https://www.theverge.com/news/603581/youtube-premium-experim...

replies(2): >>44383784 #>>44385490 #

132. angst ◴[26 Jun 25 01:28 UTC] No.44383420{3}[source]▶

>>44381178 #

at least human-generated transcriptions have entities that we can hold responsible for...

replies(1): >>44383704 #

133. canyp ◴[26 Jun 25 01:29 UTC] No.44383424[source]▶

>>44378132 #

Came here just for this.

134. fuzztester ◴[26 Jun 25 01:30 UTC] No.44383427[source]▶

>>44376989 (OP) #

Stop being slaves of extorters of any kind, and just leave.

there is tons of this happening everywhere, and we need to fight this, and boycott it.

135. userbinator ◴[26 Jun 25 01:46 UTC] No.44383507{3}[source]▶

>>44382864 #

but even when I play back my own videos it feels painfully slow unless it's 2x.

Watching your video at 1x still feels too slow, and it's just right for me at 2x speed (that's approximately how fast I normally talk if others don't tell me to slow down), although my usual YouTube watching speed is closer to 2.5-3x. That is to say, you're still faster than a lot of others.

I think it just takes practice --- I started at around 1.25x for videos, and slowly moved up from there. As you have noticed, once you've consumed enough sped-up content, your own speaking speed will also naturally increase.

136. userbinator ◴[26 Jun 25 01:49 UTC] No.44383518{4}[source]▶

>>44383237 #

You can achieve a similar, less permanent effect by closing your eyes; I often do it when I'm on a call and the person on the other end is extremely difficult to understand.

137. userbinator ◴[26 Jun 25 01:53 UTC] No.44383539{5}[source]▶

>>44379536 #

I wonder if there is negative side effects of this though, do you notice when interacting with people who speak slower require a greater deal of patience?

You are basically training your brain to work faster, and I suspect that causes some changes in the structure of your memory; if someone speaks too slowly, I'll be more likely to forget what they said earlier, compared to if they quickly gave me the entire sentence.

138. Tyr42 ◴[26 Jun 25 02:15 UTC] No.44383644{5}[source]▶

>>44383355 #

Driving is a hard 1.0 for me. But otherwise 2.0 is good.

139. xenator ◴[26 Jun 25 02:22 UTC] No.44383686[source]▶

>>44379409 #

Seems like Thai. Thai translation and recognition is like 10 years ago comparing to other languages I'm dealing with in my everyday life. Good news tho is the same level was for Russian years ago, and now it is near perfect.

replies(1): >>44384176 #

140. _kb ◴[26 Jun 25 02:25 UTC] No.44383704{4}[source]▶

>>44383420 #

That still holds true for gen-AI. Organisations that provide transcription services can’t offload responsibility to a language model any more than they can to steno keyboard manufacturers.

If you are the one feeding content to a model then you are that responsible entity.

141. pottertheotter ◴[26 Jun 25 02:28 UTC] No.44383717[source]▶

>>44376989 (OP) #

You can just ask Gemini to summarize it for you. It's free. I do it all the time with YouTube videos.

Or you can just copy the transcript that YouTube provides below the video.

142. fuzztester ◴[26 Jun 25 02:34 UTC] No.44383753{3}[source]▶

>>44382864 #

James Goodnight of SAS Institute:

https://en.m.wikipedia.org/wiki/James_Goodnight

I have watched one or two videos of his, and he spoke slowly, compared to the average person. I liked that. It sounded good.

143. K2L8M11N2 ◴[26 Jun 25 02:42 UTC] No.44383784{4}[source]▶

>>44383398 #

As a premium subscriber I currently have 4x available on Android and they recently (in the last month) added it to web too

144. TimorousBestie ◴[26 Jun 25 02:57 UTC] No.44383836{5}[source]▶

>>44383232 #

Not sarcasm, just a little joke. I thought the emote at the end would prevent it from being taken seriously. . .

145. makeitdouble ◴[26 Jun 25 03:18 UTC] No.44383906{3}[source]▶

>>44382864 #

Your video sounded a tad fast at 2x and pretty fine at 1.5.

Now I think speed adjustment come less from the natural speaking pace of the person than the subject matter.

I'm thinking of a channel like Accented Cinema (https://youtu.be/hfruMPONaYg), with a slowish talking pace, but as there's all the visual part going on at all times, it actually doesn't feel slow to my ear.

I felt the same for videos explaining concept I have no familiarity with, so I see as how fast the brain can process the info, less than the talking speed per se.

146. georgemandis ◴[26 Jun 25 04:21 UTC] No.44384158[source]▶

>>44382153 #

I kind of want to take a more proper poke at this but focus more one summarization accuracy over word-for-word accuracy, though I see the value in both.

I'm actually curious, if I run transcriptions back-to-back-to-back on the exact same audio, how much variance should I expect?

Maybe I'll try three approaches:

- A straight diff comparison (I know a lot of people are calling for this, but I really think this is less useful than it sounds)

- A "variance within the modal" test running it multiple times against the same audio, tracking how much it varies between runs

- An LLM analysis assessing if the primary points from a talk were captured and summarized at 1x, 2x, 3x, 4x runs (I think this is far more useful and interesting)

147. 55555 ◴[26 Jun 25 04:27 UTC] No.44384176{3}[source]▶

>>44383686 #

Well the weird thing is honestly their speech to text recognizes 97% of words correctly. The subtitle content is pretty perfect. It’s just the formatting that’s awful.

148. ProllyInfamous ◴[26 Jun 25 04:49 UTC] No.44384272[source]▶

>>44378872 #

I am a blue collar electrician. Not a coder (but definitely geeky).

Whisper works quite well on Apple Silicon with simple drag/drop install (i.e. no terminal commands). Program is free; you can get an M4 mini for ~$550; don't see how an online platform can even compete with this, except for one-off customers (i.e. not great repeat customers).

We used it to transcribe ddaayyss of audio microcassettes which my mother had made during her lifetime. Whisper.app even transcribed a few hours that are difficult to comprehend as a human listener. It is VERY fast.

I've used the text to search for timestamps worth listening to, skipping most dead-space (e.g. she made most while driving, in a stream of not-always-focused consciousness).

149. ars ◴[26 Jun 25 05:12 UTC] No.44384354{3}[source]▶

>>44382788 #

Install this: https://mybrowseraddon.com/video-speed-control.html

I listen to a lot of videos on 3 or even 4x.

150. ars ◴[26 Jun 25 05:13 UTC] No.44384356{4}[source]▶

>>44379461 #

I use this extension: https://mybrowseraddon.com/video-speed-control.html

151. likium ◴[26 Jun 25 05:26 UTC] No.44384418{3}[source]▶

>>44380152 #

What tool do you use?

152. BrunoJo ◴[26 Jun 25 05:38 UTC] No.44384477[source]▶

>>44376989 (OP) #

If you look for a cheaper transcription API you could als use https://Lemonfox.ai. We've optimized the API for long audio files and are much faster and cheaper than OpenAI.

153. BrunoJo ◴[26 Jun 25 05:47 UTC] No.44384523[source]▶

>>44379019 #

Let me know if you are interested in a more reliable transcription API. I'm building Lemonfox.ai and we've optimized our transcription API to be highly available and very fast for large files. Happy to give you a discount (email: bruno at lemonfox.ai)

154. raincole ◴[26 Jun 25 05:59 UTC] No.44384572[source]▶

>>44378334 #

A lot of people read newspaper.

Newspaper is essentially just an inaccurate summary of what really happened. So I don't find this realization that uncomfortable.

155. raincole ◴[26 Jun 25 06:10 UTC] No.44384623[source]▶

>>44377908 #

Geez, that repo[0] has 8k stars on Github?

Are people just staring it for meme value or something? Is this a scam?

[0]: https://github.com/Olow304/memvid

156. mt_ ◴[26 Jun 25 06:25 UTC] No.44384694{3}[source]▶

>>44383325 #

What is the speaker showcasing in its slides, what is it's body language and so on.

157. jwrallie ◴[26 Jun 25 06:30 UTC] No.44384720[source]▶

>>44378345 #

From my own experience with whisper.cpp, normalizing the audio and removing silence not only shortens the process time significantly, but also increases a lot the quality of the transcription, as silence can mean hallucinations. You can do that graphically with Audacity too, if you do not want to deal with the command line. You also do not need any special hardware to run whisper.cpp, with the small model literally any computer should be able to do it if you can wait a bit (less than the audio length).

One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.

replies(2): >>44384975 #>>44385016 #

158. weird-eye-issue ◴[26 Jun 25 06:43 UTC] No.44384794[source]▶

>>44384764 #

Can we ban this "person" for AI replies?

replies(1): >>44385745 #

159. conjecTech ◴[26 Jun 25 06:56 UTC] No.44384893[source]▶

>>44376989 (OP) #

If you are hosting whisper yourself, you can do something slightly more elegant, but with the same effect. You can downsample/pool the context 2:1 (or potentially more) a few layers into the encoder. That allows you to do the equivalent of speeding up audio without worry about potential spectral losses. For whisper large v3, that gets you nearly double throughput in exchange for a relative ~4% WER increase.

replies(1): >>44385135 #

160. ilyakaminsky ◴[26 Jun 25 07:04 UTC] No.44384932{5}[source]▶

>>44382158 #

It's sustainable, but not enough to retire on at this point.

> Just wondering if I cam build a retirement out of APIs :)

I think it's possible, but you need to find a way to add value beyond the commodity itself (e.g., audio classification and speaker diarization in my case).

161. d1sxeyes ◴[26 Jun 25 07:12 UTC] No.44384975{3}[source]▶

>>44384720 #

1/3 of the meeting is silence? That’s a good thing. It’s allowing people time to think over what they’re hearing, there are pauses to allow people to contribute or participate. What do you think a better percentage of silent time would be?

replies(1): >>44386518 #

162. sudhirj ◴[26 Jun 25 07:21 UTC] No.44385016{3}[source]▶

>>44384720 #

If a human meeting had lot of silence (assuming it's between words and not before / after), I would consider it a very efficient meeting where there was just enough information exchanged with adequate absorption, processing and response time.

163. nomercy400 ◴[26 Jun 25 07:42 UTC] No.44385135[source]▶

>>44384893 #

Do you have more details or examples on how to downsample the context in the encoder? I treat the encoder as an opaque block, so I have no idea where to start.

164. PeterStuer ◴[26 Jun 25 07:55 UTC] No.44385188[source]▶

>>44376989 (OP) #

I wonder how much time and battery transcoding/uploading/downloading over coffeeshop wifi would realy save vs just running it locally through optimized Whisper.

165. retsibsi ◴[26 Jun 25 08:14 UTC] No.44385284{3}[source]▶

>>44382864 #

Your speaking speed is noticeably faster than usual, but I think it's good for this kind of video. When the content is really dense and every word is chosen for maximum information value, a slower speed would be good, but for relatively natural speech with a normal amount of redundancy I think it's fine to go at this speed.

166. KPennig86852 ◴[26 Jun 25 08:51 UTC] No.44385489[source]▶

>>44376989 (OP) #

But you know that you can run OpenAI's Whisper audio recognition model locally for free, right? It has very little GPU requirements, and the new "turbo" model works quite fast (there are also several Python libraries which make it significantly faster still).

167. zelphirkalt ◴[26 Jun 25 08:51 UTC] No.44385490{4}[source]▶

>>44383398 #

Probably, because they are "A/B testing" things, that do not really show much effect or depend on more circumstances, than they care to eliminate and then overinterpret the results. Like almost all corporate A/B testing.

168. dajonker ◴[26 Jun 25 09:15 UTC] No.44385605[source]▶

>>44376989 (OP) #

Gemini 2.5 pro is, in my usage, quite superior for high quality transcriptions of phone calls, in Dutch in my case. As long as you upload the audio to GCS there you can easily process conversations of over an hour. It correctly identified and labeled speakers.

The cheaper 2.5 flash made noticeably more mistakes, for example it didn't correctly output numbers while the Pro model did.

As for OpenAI, their gpt-4o-transcribe model did worse than 2.5 flash, completely messing up names of places and/or people. Plus it doesn't label the conversation in turns, it just outputs a single continuous piece of text.

169. Raphell ◴[26 Jun 25 09:45 UTC] No.44385745{3}[source]▶

>>44384794 #

I get it. But I'm just someone who likes to think things through and say them simply.

replies(1): >>44385956 #

170. weird-eye-issue ◴[26 Jun 25 10:24 UTC] No.44385956{4}[source]▶

>>44385745 #

You aren't saying anything simply. You are verbosely saying nothing.

171. yashasolutions ◴[26 Jun 25 10:47 UTC] No.44386079[source]▶

>>44376989 (OP) #

the question would be how to do that but also still get proper time code when using whisper to get the subtitles

172. tucnak ◴[26 Jun 25 10:57 UTC] No.44386128{5}[source]▶

>>44381453 #

> FFT plus some basic analysis

Yeah, totally easier than `len(transcribe(a))/len(a)`

173. quietbritishjim ◴[26 Jun 25 11:08 UTC] No.44386182{3}[source]▶

>>44382864 #

Your actual speed of talking sounds a little faster than average but not notably so.

But it feels (very subjectively) faster to me than usual because you don't really seem to take any pauses. It's like the whole video is a single run-on sentence that I keep buffering, but I never get a chance to process it and flush the buffer.

174. phito ◴[26 Jun 25 11:21 UTC] No.44386272{6}[source]▶

>>44381944 #

> naming things is famously one of the two hard problems in computer science

Isn't ffmpeg made by a French person? As a francophone myself, I can tell you one of the biggest weakness of francophone programmers is naming things, even worse when it's in English. Maybe it's what's at play here.

175. ada1981 ◴[26 Jun 25 11:50 UTC] No.44386483{6}[source]▶

>>44381944 #

Curious if this is helpful.

https://claude.ai/public/artifacts/96ea8227-48c3-484d-b30b-6...

I had Claude rewrite the documentation for silenceremove based on your feedback.

176. jwrallie ◴[26 Jun 25 11:54 UTC] No.44386518{4}[source]▶

>>44384975 #

Good point, somehow if I think of a 30 minutes meeting, 10 minutes of silence sounds great, but seeing a 1 hour block disappear from a 3 hour recording makes me want to use that “free” hour to do something else.

Well, I don’t think silence is not the real problem with a 3 hour meeting!

177. lordspace ◴[26 Jun 25 12:07 UTC] No.44386599[source]▶

>>44379755 #

that's a really good summary :)

178. Graziano_M ◴[26 Jun 25 12:07 UTC] No.44386602[source]▶

>>44378164 #

Well a picture is worth a thousand tokens.

179. david_allison ◴[26 Jun 25 12:35 UTC] No.44386815{3}[source]▶

>>44382788 #

I have up to 4x (in steps of 0.05) with YouTube Premium on Android

↑