Skip to content

whisper

What it does

Takes a video or audio file and pulls out everything that was said — a transcript with timestamps. Can also analyze the video visually (what's on screen at each moment).

Built on OpenAI's Whisper speech recognition model. Handles most languages and accents reliably.

When to use it

  • Reverse-engineering a viral video — pull the exact script that worked, study the timing
  • Ingesting reference material — a client sent a 20-min interview and you want to scan it for usable quotes
  • Knowledge updates — converting a long-form video into searchable text for the knowledge base
  • Compliance review — checking what was said in an existing piece of content

Where it fits in the pipeline

flowchart LR
    A[Source video<br/>or audio] --> B[whisper skill]
    B --> C[Transcript<br/>with timestamps]
    C --> D{What for?}
    D -->|video copy| E[Map dialogue to scenes]
    D -->|research| F[Searchable notes]
    D -->|knowledge update| G[Knowledge Updater agent]

How to invoke

You: transcribe /path/to/source.mp4

Claude: [triggers whisper, returns transcript with timestamps]

Or with visual analysis:

You: break down /path/to/source.mp4 — what does it say AND what's on screen

Claude: [triggers whisper for audio, vision model for visual breakdown,
         returns combined analysis]

Trigger phrases:

  • "transcribe [file]"
  • "extract dialogue from [file]"
  • "get the transcript of [file]"
  • "what does this video say" (when a media file path is provided)
  • "break down this video"

What you get back

Audio-only transcription

``` [00:00:00 — 00:00:04] So I'm 47 and my doctor said something that completely changed my life.

[00:00:04 — 00:00:09] He told me that the brain fog, the mood swings, and the 3am wake-ups weren't separate problems.

[00:00:09 — 00:00:14] They were all connected to one thing. And I had no idea. ```

With visual analysis

Adds per-timestamp notes on what's visible:

[00:00:00 — 00:00:04] AUDIO: "So I'm 47 and my doctor said something..."
                      VISUAL: Tight selfie, kitchen background, soft
                      natural light from camera-left. Subject in
                      cream sweater, slight forward lean.

[00:00:04 — 00:00:09] AUDIO: "He told me that the brain fog..."
                      VISUAL: Cut to medium shot. Subject holding
                      a coffee mug, gestures with free hand.

Example session

You: transcribe Assets/XYZG3/source.mp4 — this is a viral video I want
     to copy. I need the dialogue AND a sense of what's on screen at
     each scene.

Claude: Running whisper with visual analysis on Assets/XYZG3/source.mp4...

        File length: 1m 12s
        Detected language: English
        12 hard cuts (so 12 scenes — see contact sheets for visual
        breakdown).

        Transcript saved to Assets/XYZG3/transcript.txt.

        Quick summary of structure:
        - 0:00-0:04 — Hook (authority-betrayal opener)
        - 0:04-0:09 — Symptom cluster
        - 0:09-0:30 — Story / personal experience
        - 0:30-0:50 — Product introduction
        - 0:50-1:12 — CTA / call-to-action

        Ready to delegate to the Script Writer (Mode 1 — adaptation)?

Limitations

  • Speech only. If a video has no dialogue (pure music, ASMR, B-roll only), the transcript will be empty
  • Won't detect emotion, sarcasm, or speaking style nuance — just the words
  • Long videos take longer to process (a 30-min video = several minutes of transcription time)
  • Background music with vocals can confuse the transcription
  • Visual analysis is separate from audio — don't expect frame-perfect alignment of "what they said" with "what was on screen at that exact millisecond"