I often want information from a long YouTube video or audio recording without having to listen to the entire content. Whether it’s a recipe video, an audio recording, or something educational.
Recently, locally runnable LLMs have improved enough to run on my MacBook Pro, with a large degree of success. That got me wondering:
Could I build a script that not only summarizes videos and audio, but lets me ask follow-up questions?
Turns out the answer is “yes”.
If you’re just here because you want to use these scripts for yourself, see:
The latter is just the former wrapped in an interactive chat interface. See the documentation in the script for more context.
Architecture AKA How It Works
The approach breaks down into a few key steps:
- Transcript Extraction: Download subtitles from YouTube if available, or extract audio and transcribe locally
- LLM Inference: Pass the transcript to a local LLM that can handle long contexts
- Interactive Chat (Optional): Use an agentic interface for follow-up questions
This enables you to ask questions about content without watching/listening to the whole thing.
Architecture
%%{init: { 'theme': 'default' } }%%
graph TD
Input["Input: YouTube URL or YouTube video ID or Local file path"]
Detect{"YouTube or Local file?"}
YouTube["📹 YouTube"]
Local["🎵 Local Audio/Video"]
YTSubtitles{"Subtitles
available?"}
YTFetch["Download subtitles
(instant)"]
YTNoSub["Download audio +
Parakeet-MLX
"]
LocalExtract["Extract/trim audio
via ffmpeg"]
LocalTranscribe["Parakeet-MLX
transcription"]
Transcript["📄 Transcript
(chars/tokens)"]
LLM["🤖 Local LLM
(oMLX)"]
Response["Response"]
Chat{"Interactive?"}
OneShot["Return result"]
ChatMode["Launch chat session
(pi-mono)"]
Input --> Detect
Detect -->|Yes| YouTube
Detect -->|No| Local
YouTube --> YTSubtitles
YTSubtitles -->|Yes| YTFetch
YTSubtitles -->|No| YTNoSub
Local --> LocalExtract
LocalExtract --> LocalTranscribe
YTFetch --> Transcript
YTNoSub --> Transcript
LocalTranscribe --> Transcript
Transcript --> LLM
LLM --> Response
Response --> Chat
Chat -->|One-shot| OneShot
Chat -->|Yes| ChatMode
style Input fill:#e1f5ff
style Transcript fill:#f3e5f5
style LLM fill:#e8f5e9
style Response fill:#fff3e0
Example Executions
Youtube Video With Downloadable Subtitles
Conveniently, TED talks usually have transcriptions downloadable. Let’s demonstrate our scripts first by summarizing the video: “Beware the Power of Prediction”.
$ audio-ask.py "https://www.youtube.com/watch?v=OS4wHmKtH-Q" "summarize this"
Transcript length: 23,471 chars
Processing time: 28.2s
This transcript is a speech about the nature of prediction, arguing that predictions are often tools of power rather than mere sources of knowledge.
### Core Argument
The speaker posits that while we often view predictions as objective descriptions of the future, they are actually **"speech acts"**—language that performs an action. In a social context, predictions can act as "veiled commands" or "magnets" that shape reality to fulfill the interests of the predictor, creating self-fulfilling prophecies.
### Key Themes
* **Power vs. Knowledge:** The speaker uses the historical anecdote of King Louis XI and his astrologer to illustrate how predictions can be used to manipulate or secure one's position. In the modern era, this is mirrored by tech executives and AI, where predictions are used to justify value-laden decisions under the guise of "facts."
* **The Illusion of Inevitability:** A major "red flag" identified is when predictors claim the future is inevitable. This is a tactic to stop questioning and encourage obedience.
* **Social Impact and Injustice:** The speaker warns that algorithmic predictions (e.g., regarding insurance or criminal justice) can create a "Kafkaesque" world where decisions are unchallengeable because they are based on unverifiable forecasts rather than historical facts.
* **The Role of Uncertainty:** The speech reframes uncertainty not as a threat to be eliminated, but as "good news" that signifies the future is still unwritten and subject to human agency.
You can then ask follow-up questions: `pi -c "ask your question here"`
SUCCESS | === ASK STATS ===
Agent: pi
Model: gemma-4-26b-a4b-it-4bit
Transcript length: 23,470 chars
Transcription: 3.9s
LLM response: 28.2s
Total: 32.1s
This worked great!
Very Large YouTube Video (~6 hour long video)
Next, let’s test with this 5:55:12 “documentary” about a science fiction game called EVE Online to highlight the capability and performance of this approach.
$ audio-ask.py --pi-model Qwen3.6-35B-A3B-8bit-long-context \
"https://www.youtube.com/watch?v=BCSeISYcoyI" "summarize this"
- Transcript: 758,153 characters (323k tokens)
- Model: Qwen3.6-35B-A3B-8bit-long-context (512K context)
- Processing Time: 28m 27s (LLM thinking)
- Total Time: 28m 31s (including metadata + subtitle fetch)
The LLM successfully produced a comprehensive, well-structured summary covering the game’s 20-year history across multiple eras. I’m omitting the full summary from this post since it’s not central to demonstrating the approach’s capabilities. This example demonstrates:
- Our approach handles 758k+ character transcripts without crashing
- It produces coherent, well-structured summaries
- It completes in reasonable time (~30 minutes for 5:55 video)
- The 512K context window is stable and effective
To further summarize its capability, I was able to ask it a question I knew the answer to, and was covered in the video, and the tool answered correctly:
Me: What was the name of the first Titan class ship produced called? LLM: Based on the transcript, the name of EVE Online’s first Titan ship was Steve.
This demonstrates the tool’s accuracy.
Local Audio File
This approach also works with local audio files:
$ audio-ask.py ~/Downloads/meeting_recording.m4a "summarize the key decisions"
Limitations & Tradeoffs
The primary limitation of this approach is the model’s context window size. Not all models can handle very long transcripts in a single pass.
Solution 1: Use a Model with Extended Context
For large transcripts, use a model specifically trained or optimized for long contexts. As demonstrated with our 5:55 EVE Online documentary example (758k characters), Qwen3.6-35B with YaRN context extension can handle 512K+ tokens.
Here’s how it works: Qwen3.6 has a native 256K context window, but uses RoPE (Rotary Position Embeddings) scaling—a mathematical interpolation technique—to extend beyond its training distribution. This allows it to extrapolate to 1M tokens via YaRN context extension. While this trades some quality/speed for capacity, it enables processing very long documents without modifying the model weights.
For more technical details, see:
- RoPE and context extension in LM Studio (approachable overview)
- YaRN: Efficient Context Window Extension of LLMs (technical paper)
I decided to use this approach as it worked for my beefy system, and was simpler to implement.
Solution 2: Chunking Strategy
If a large-context model isn’t available or practical, recursively process the transcript:
- Split the transcript into overlapping chunks that fit within the model’s context window
- Apply your query to each chunk independently
- Summarize the combined results in a final pass
This approach is more computationally expensive but works with smaller models.
Conclusion
Overall, these scripts have been pretty useful in my day-to-day. Hopefully this post inspires someone, and the linking of the scripts allows you to more quickly set something like this up for yourself. Cheers!