Ask Questions About Audio and YouTube Locally

I often want information from a long YouTube video or audio recording without having to listen to the entire content. Whether it’s a recipe video, an audio recording, or something educational.

Recently, locally runnable LLMs have improved enough to run on my MacBook Pro, with a large degree of success. That got me wondering:

Could I build a script that not only summarizes videos and audio, but lets me ask follow-up questions?

Turns out the answer is “yes”.

If you’re just here because you want to use these scripts for yourself, see:

The latter is just the former wrapped in an interactive chat interface. See the documentation in the script for more context.

Architecture AKA How It Works

The approach breaks down into a few key steps:

Transcript Extraction: Download subtitles from YouTube if available, or extract audio and transcribe locally
LLM Inference: Pass the transcript to a local LLM that can handle long contexts
Interactive Chat (Optional): Use an agentic interface for follow-up questions

This enables you to ask questions about content without watching/listening to the whole thing.

Architecture

%%{init: { 'theme': 'default' } }%%
graph TD
    Input["Input: YouTube URL or YouTube video ID or Local file path"]
    Detect{"YouTube or Local file?"}
    
    YouTube["📹 YouTube"]
    Local["🎵 Local Audio/Video"]
    
    YTSubtitles{"Subtitles
available?"}
    YTFetch["Download subtitles
(instant)"]
    YTNoSub["Download audio +
Parakeet-MLX
"]
    
    LocalExtract["Extract/trim audio
via ffmpeg"]
    LocalTranscribe["Parakeet-MLX
transcription"]
    
    Transcript["📄 Transcript
(chars/tokens)"]
    LLM["🤖 Local LLM
(oMLX)"]
    Response["Response"]
    Chat{"Interactive?"}
    OneShot["Return result"]
    ChatMode["Launch chat session
(pi-mono)"]
    
    Input --> Detect
    Detect -->|Yes| YouTube
    Detect -->|No| Local
    
    YouTube --> YTSubtitles
    YTSubtitles -->|Yes| YTFetch
    YTSubtitles -->|No| YTNoSub
    
    Local --> LocalExtract
    LocalExtract --> LocalTranscribe
    
    YTFetch --> Transcript
    YTNoSub --> Transcript
    LocalTranscribe --> Transcript
    
    Transcript --> LLM
    LLM --> Response
    Response --> Chat
    Chat -->|One-shot| OneShot
    Chat -->|Yes| ChatMode
    
    style Input fill:#e1f5ff
    style Transcript fill:#f3e5f5
    style LLM fill:#e8f5e9
    style Response fill:#fff3e0

Example Executions

Youtube Video With Downloadable Subtitles

Conveniently, TED talks usually have transcriptions downloadable. Let’s demonstrate our scripts first by summarizing the video: “Beware the Power of Prediction”.

$ audio-ask.py "https://www.youtube.com/watch?v=OS4wHmKtH-Q" "summarize this"

Transcript length: 23,471 chars
Processing time: 28.2s

This transcript is a speech about the nature of prediction, arguing that predictions are often tools of power rather than mere sources of knowledge.

### Core Argument
The speaker posits that while we often view predictions as objective descriptions of the future, they are actually **"speech acts"**—language that performs an action. In a social context, predictions can act as "veiled commands" or "magnets" that shape reality to fulfill the interests of the predictor, creating self-fulfilling prophecies.

### Key Themes
*   **Power vs. Knowledge:** The speaker uses the historical anecdote of King Louis XI and his astrologer to illustrate how predictions can be used to manipulate or secure one's position. In the modern era, this is mirrored by tech executives and AI, where predictions are used to justify value-laden decisions under the guise of "facts."
*   **The Illusion of Inevitability:** A major "red flag" identified is when predictors claim the future is inevitable. This is a tactic to stop questioning and encourage obedience.
*   **Social Impact and Injustice:** The speaker warns that algorithmic predictions (e.g., regarding insurance or criminal justice) can create a "Kafkaesque" world where decisions are unchallengeable because they are based on unverifiable forecasts rather than historical facts.
*   **The Role of Uncertainty:** The speech reframes uncertainty not as a threat to be eliminated, but as "good news" that signifies the future is still unwritten and subject to human agency.

You can then ask follow-up questions: `pi -c "ask your question here"`
SUCCESS  | === ASK STATS ===
  Agent:              pi
  Model:              gemma-4-26b-a4b-it-4bit
  Transcript length:  23,470 chars
  Transcription:       3.9s
  LLM response:        28.2s
  Total:               32.1s

This worked great!

Very Large YouTube Video (~6 hour long video)

Next, let’s test with this 5:55:12 “documentary” about a science fiction game called EVE Online to highlight the capability and performance of this approach.

$ audio-ask.py --pi-model Qwen3.6-35B-A3B-8bit-long-context \
  "https://www.youtube.com/watch?v=BCSeISYcoyI" "summarize this"

Transcript: 758,153 characters (323k tokens)
Model: Qwen3.6-35B-A3B-8bit-long-context (512K context)
Processing Time: 28m 27s (LLM thinking)
Total Time: 28m 31s (including metadata + subtitle fetch)

The LLM successfully produced a comprehensive, well-structured summary covering the game’s 20-year history across multiple eras. I’m omitting the full summary from this post since it’s not central to demonstrating the approach’s capabilities. This example demonstrates:

Our approach handles 758k+ character transcripts without crashing
It produces coherent, well-structured summaries
It completes in reasonable time (~30 minutes for 5:55 video)
The 512K context window is stable and effective

To further summarize its capability, I was able to ask it a question I knew the answer to, and was covered in the video, and the tool answered correctly:

Me: What was the name of the first Titan class ship produced called? LLM: Based on the transcript, the name of EVE Online’s first Titan ship was Steve.

This demonstrates the tool’s accuracy.

Local Audio File

This approach also works with local audio files:

$ audio-ask.py ~/Downloads/meeting_recording.m4a "summarize the key decisions"

Limitations & Tradeoffs

The primary limitation of this approach is the model’s context window size. Not all models can handle very long transcripts in a single pass.

Solution 1: Use a Model with Extended Context

For large transcripts, use a model specifically trained or optimized for long contexts. As demonstrated with our 5:55 EVE Online documentary example (758k characters), Qwen3.6-35B with YaRN context extension can handle 512K+ tokens.

Here’s how it works: Qwen3.6 has a native 256K context window, but uses RoPE (Rotary Position Embeddings) scaling—a mathematical interpolation technique—to extend beyond its training distribution. This allows it to extrapolate to 1M tokens via YaRN context extension. While this trades some quality/speed for capacity, it enables processing very long documents without modifying the model weights.

For more technical details, see:

RoPE and context extension in LM Studio (approachable overview)
YaRN: Efficient Context Window Extension of LLMs (technical paper)

I decided to use this approach as it worked for my beefy system, and was simpler to implement.

Solution 2: Chunking Strategy

If a large-context model isn’t available or practical, recursively process the transcript:

Split the transcript into overlapping chunks that fit within the model’s context window
Apply your query to each chunk independently
Summarize the combined results in a final pass

This approach is more computationally expensive but works with smaller models.

Conclusion

Overall, these scripts have been pretty useful in my day-to-day. Hopefully this post inspires someone, and the linking of the scripts allows you to more quickly set something like this up for yourself. Cheers!

Architecture AKA How It Works#

Architecture#

Example Executions#

Youtube Video With Downloadable Subtitles#

Very Large YouTube Video (~6 hour long video)#

Local Audio File#

Limitations & Tradeoffs#

Solution 1: Use a Model with Extended Context#

Solution 2: Chunking Strategy#

Conclusion#