Transcribing Reference Videos with whisper¶
When you want the exact dialogue, timing, and structure of a specific video — not just the structural analysis — use the whisper skill. It turns audio into a transcript with timestamps.
Why you'd want a transcript¶
A creator scan tells you they use "authority-betrayal hooks." A transcript shows you the literal words they're using, the rhythm they speak in, and the exact timing of each beat.
Common use cases:
- Video copy workflows — you're replicating a specific video; you need the dialogue mapped scene-by-scene
- Hook analysis — what exact phrasing is making the first 4 seconds work
- Pacing study — how long is each beat? When do they cut?
- Ingesting source material — a client sent a long-form video; pull a transcript for reference
How to invoke¶
Plain transcription:
You: transcribe Assets/XYZG3/source.mp4
Claude:
Running whisper on Assets/XYZG3/source.mp4...
File length: 1m 12s
Detected language: English
Transcript with timestamps:
[00:00:00 — 00:00:04] So I'm 47 and my doctor said something that
completely changed my life.
[00:00:04 — 00:00:09] He told me that the brain fog, the mood
swings, and the 3am wake-ups weren't separate
problems.
[00:00:09 — 00:00:14] They were all connected to one thing. And I
had no idea.
[...continues for full video...]
Saved to Assets/XYZG3/transcript.txt.
With visual analysis (audio + per-timestamp visual notes):
You: break down Assets/XYZG3/source.mp4 — I want dialogue AND what's
happening visually at each timestamp.
Claude:
Running whisper for audio + vision for visual frames...
[00:00:00 — 00:00:04]
AUDIO: "So I'm 47 and my doctor said something..."
VISUAL: Tight selfie, chest-up framing, kitchen background. Soft
natural light from camera-left. Subject in cream sweater,
slight forward lean, holding coffee mug.
[00:00:04 — 00:00:09]
AUDIO: "He told me that the brain fog..."
VISUAL: Cut to slightly wider medium shot. Subject's free hand
gestures, mug still in other hand. Background slightly
blurred — kitchen island visible behind.
[00:00:09 — 00:00:14]
AUDIO: "They were all connected to one thing..."
VISUAL: Tightens back in. Subject leans further forward,
establishing direct eye contact with camera.
[...continues...]
The combined output is much richer for video-copy work because you see both layers at once.
When to use audio-only vs. with visuals¶
| Situation | Audio-only | With visuals |
|---|---|---|
| Studying hook phrasing | Yes | — |
| Understanding pacing | Yes | — |
| Video-copy workflow | — | Yes |
| Building a creative template | — | Yes |
| Knowledge ingestion (long-form) | Yes | — |
| Compliance review of an existing video | Yes | — |
Audio-only is faster and cheaper. The visual analysis is an extra pass that adds context — use it when the visual structure matters.
Reading the timestamps¶
[mm:ss — mm:ss] format. The first timestamp is the start of that beat; the second is the end. The difference is the beat's duration, which maps directly to Veo clip planning:
Scene durations from the transcript:
Beat 1: 0:00 – 0:04 → 4 seconds → Veo clip 1 (within 8s limit, OK)
Beat 2: 0:04 – 0:09 → 5 seconds → Veo clip 2 (OK)
Beat 3: 0:09 – 0:14 → 5 seconds → Veo clip 3 (OK)
Beat 4: 0:14 – 0:22 → 8 seconds → Veo clip 4 (right at the limit)
Beat 5: 0:22 – 0:35 → 13 seconds → DOES NOT FIT in one Veo clip; must split
Any beat over 8 seconds needs to be split into multiple scenes when you build your .nbflow. The transcript is what tells you where the splits should go.
Using the transcript for a video-copy workflow¶
The standard workflow for replicating a viral video uses the transcript as the starting point:
- Run whisper with visuals on the source video
- Map beats to scenes — each beat is one Veo clip (or splits if over 8s)
- Note visual signatures per beat for the image gens
- Pass to the Visual Planner for storyboarding
- Pass to the Script Writer with instructions to ADAPT (not copy) the dialogue
The pipeline's video-copy workflow uses transcripts heavily — see Chapter 11 — Video Copy Workflow.
When transcripts have issues¶
Background music with vocals- Whisper may confuse music vocals with speech. Manually edit the transcript to remove music lyrics that crept in.
Strong accents / dialect- Whisper handles most accents well but heavy regional dialect can produce errors. Spot-check the transcript against the audio.
Long pauses / silence- Whisper sometimes invents speech in silence (rare but happens). Spot-check any beat that seems suspiciously short or long.
Multiple speakers- Whisper transcribes all speech but doesn't differentiate speakers by default. If the source has two people talking, the transcript may be confusing.
B-roll segments- B-roll cutaways often have voiceover continuing from the speaker. The transcript captures the audio, which is fine — but the visual analysis will see "B-roll cutaway" not "speaker."
Saving the transcript¶
By convention, save transcripts to:
This is the same Assets/ folder where your contact sheets, reference frames, and other source material live. The Visual Planner and Script Writer both look here for source material when building a workflow from a reference video.
Transcribing your own old videos¶
Whisper isn't just for competitor research. It works on:
- Your existing winning videos — pull transcripts of your top-performers to use as templates for new variants
- Client-provided source material — interviews, presentations, raw footage they want adapted into short-form
- Long-form videos — podcasts, lectures, masterclasses. Extract usable clips by skimming the transcript.
When you're ready¶
→ Next: Building a Creative Template — synthesize all your research (niche scan + creator scan + transcripts) into a reusable template that the Visual Planner can apply.