Skip to content

Transcribing Reference Videos with whisper

When you want the exact dialogue, timing, and structure of a specific video — not just the structural analysis — use the whisper skill. It turns audio into a transcript with timestamps.

Why you'd want a transcript

A creator scan tells you they use "authority-betrayal hooks." A transcript shows you the literal words they're using, the rhythm they speak in, and the exact timing of each beat.

Common use cases:

  • Video copy workflows — you're replicating a specific video; you need the dialogue mapped scene-by-scene
  • Hook analysis — what exact phrasing is making the first 4 seconds work
  • Pacing study — how long is each beat? When do they cut?
  • Ingesting source material — a client sent a long-form video; pull a transcript for reference

How to invoke

Plain transcription:

You: transcribe Assets/XYZG3/source.mp4

Claude:
  Running whisper on Assets/XYZG3/source.mp4...
  File length: 1m 12s
  Detected language: English

  Transcript with timestamps:

  [00:00:00 — 00:00:04] So I'm 47 and my doctor said something that
                        completely changed my life.

  [00:00:04 — 00:00:09] He told me that the brain fog, the mood
                        swings, and the 3am wake-ups weren't separate
                        problems.

  [00:00:09 — 00:00:14] They were all connected to one thing. And I
                        had no idea.

  [...continues for full video...]

  Saved to Assets/XYZG3/transcript.txt.

With visual analysis (audio + per-timestamp visual notes):

You: break down Assets/XYZG3/source.mp4 — I want dialogue AND what's
     happening visually at each timestamp.

Claude:
  Running whisper for audio + vision for visual frames...

  [00:00:00 — 00:00:04]
    AUDIO: "So I'm 47 and my doctor said something..."
    VISUAL: Tight selfie, chest-up framing, kitchen background. Soft
            natural light from camera-left. Subject in cream sweater,
            slight forward lean, holding coffee mug.

  [00:00:04 — 00:00:09]
    AUDIO: "He told me that the brain fog..."
    VISUAL: Cut to slightly wider medium shot. Subject's free hand
            gestures, mug still in other hand. Background slightly
            blurred — kitchen island visible behind.

  [00:00:09 — 00:00:14]
    AUDIO: "They were all connected to one thing..."
    VISUAL: Tightens back in. Subject leans further forward,
            establishing direct eye contact with camera.

  [...continues...]

The combined output is much richer for video-copy work because you see both layers at once.

When to use audio-only vs. with visuals

Situation Audio-only With visuals
Studying hook phrasing Yes
Understanding pacing Yes
Video-copy workflow Yes
Building a creative template Yes
Knowledge ingestion (long-form) Yes
Compliance review of an existing video Yes

Audio-only is faster and cheaper. The visual analysis is an extra pass that adds context — use it when the visual structure matters.

Reading the timestamps

[mm:ss — mm:ss] format. The first timestamp is the start of that beat; the second is the end. The difference is the beat's duration, which maps directly to Veo clip planning:

Scene durations from the transcript:
  Beat 1: 0:00 – 0:04 → 4 seconds  → Veo clip 1 (within 8s limit, OK)
  Beat 2: 0:04 – 0:09 → 5 seconds  → Veo clip 2 (OK)
  Beat 3: 0:09 – 0:14 → 5 seconds  → Veo clip 3 (OK)
  Beat 4: 0:14 – 0:22 → 8 seconds  → Veo clip 4 (right at the limit)
  Beat 5: 0:22 – 0:35 → 13 seconds → DOES NOT FIT in one Veo clip; must split

Any beat over 8 seconds needs to be split into multiple scenes when you build your .nbflow. The transcript is what tells you where the splits should go.

Using the transcript for a video-copy workflow

The standard workflow for replicating a viral video uses the transcript as the starting point:

  1. Run whisper with visuals on the source video
  2. Map beats to scenes — each beat is one Veo clip (or splits if over 8s)
  3. Note visual signatures per beat for the image gens
  4. Pass to the Visual Planner for storyboarding
  5. Pass to the Script Writer with instructions to ADAPT (not copy) the dialogue

The pipeline's video-copy workflow uses transcripts heavily — see Chapter 12 — Video Copy Workflow.

When transcripts have issues

Background music with vocals
Whisper may confuse music vocals with speech. Manually edit the transcript to remove music lyrics that crept in.
Strong accents / dialect
Whisper handles most accents well but heavy regional dialect can produce errors. Spot-check the transcript against the audio.
Long pauses / silence
Whisper sometimes invents speech in silence (rare but happens). Spot-check any beat that seems suspiciously short or long.
Multiple speakers
Whisper transcribes all speech but doesn't differentiate speakers by default. If the source has two people talking, the transcript may be confusing.
B-roll segments
B-roll cutaways often have voiceover continuing from the speaker. The transcript captures the audio, which is fine — but the visual analysis will see "B-roll cutaway" not "speaker."

Saving the transcript

By convention, save transcripts to:

projects/{month}/{brand}/Assets/{workflow}/transcript.txt

This is the same Assets/ folder where your contact sheets, reference frames, and other source material live. The Visual Planner and Script Writer both look here for source material when building a workflow from a reference video.

Transcribing your own old videos

Whisper isn't just for competitor research. It works on:

  • Your existing winning videos — pull transcripts of your top-performers to use as templates for new variants
  • Client-provided source material — interviews, presentations, raw footage they want adapted into short-form
  • Long-form videos — podcasts, lectures, masterclasses. Extract usable clips by skimming the transcript.

When you're ready

Next: Building a Creative Template — synthesize all your research (niche scan + creator scan + transcripts) into a reusable template that the Visual Planner can apply.