whisper¶
What it does¶
Takes a video or audio file and pulls out everything that was said — a transcript with timestamps. Can also analyze the video visually (what's on screen at each moment).
Built on OpenAI's Whisper speech recognition model. Handles most languages and accents reliably.
When to use it¶
- Reverse-engineering a viral video — pull the exact script that worked, study the timing
- Ingesting reference material — a client sent a 20-min interview and you want to scan it for usable quotes
- Knowledge updates — converting a long-form video into searchable text for the knowledge base
- Compliance review — checking what was said in an existing piece of content
Where it fits in the pipeline¶
flowchart LR
A[Source video<br/>or audio] --> B[whisper skill]
B --> C[Transcript<br/>with timestamps]
C --> D{What for?}
D -->|video copy| E[Map dialogue to scenes]
D -->|research| F[Searchable notes]
D -->|knowledge update| G[Knowledge Updater agent]
How to invoke¶
Or with visual analysis:
You: break down /path/to/source.mp4 — what does it say AND what's on screen
Claude: [triggers whisper for audio, vision model for visual breakdown,
returns combined analysis]
Trigger phrases:
- "transcribe [file]"
- "extract dialogue from [file]"
- "get the transcript of [file]"
- "what does this video say" (when a media file path is provided)
- "break down this video"
What you get back¶
Audio-only transcription-
``` [00:00:00 — 00:00:04] So I'm 47 and my doctor said something that completely changed my life.
[00:00:04 — 00:00:09] He told me that the brain fog, the mood swings, and the 3am wake-ups weren't separate problems.
[00:00:09 — 00:00:14] They were all connected to one thing. And I had no idea. ```
With visual analysis-
Adds per-timestamp notes on what's visible:
[00:00:00 — 00:00:04] AUDIO: "So I'm 47 and my doctor said something..." VISUAL: Tight selfie, kitchen background, soft natural light from camera-left. Subject in cream sweater, slight forward lean. [00:00:04 — 00:00:09] AUDIO: "He told me that the brain fog..." VISUAL: Cut to medium shot. Subject holding a coffee mug, gestures with free hand.
Example session¶
You: transcribe Assets/XYZG3/source.mp4 — this is a viral video I want
to copy. I need the dialogue AND a sense of what's on screen at
each scene.
Claude: Running whisper with visual analysis on Assets/XYZG3/source.mp4...
File length: 1m 12s
Detected language: English
12 hard cuts (so 12 scenes — see contact sheets for visual
breakdown).
Transcript saved to Assets/XYZG3/transcript.txt.
Quick summary of structure:
- 0:00-0:04 — Hook (authority-betrayal opener)
- 0:04-0:09 — Symptom cluster
- 0:09-0:30 — Story / personal experience
- 0:30-0:50 — Product introduction
- 0:50-1:12 — CTA / call-to-action
Ready to delegate to the Script Writer (Mode 1 — adaptation)?
Limitations¶
- Speech only. If a video has no dialogue (pure music, ASMR, B-roll only), the transcript will be empty
- Won't detect emotion, sarcasm, or speaking style nuance — just the words
- Long videos take longer to process (a 30-min video = several minutes of transcription time)
- Background music with vocals can confuse the transcription
- Visual analysis is separate from audio — don't expect frame-perfect alignment of "what they said" with "what was on screen at that exact millisecond"