Skip to content

Video Copy Workflow

The method for building a workflow by replicating an existing viral video frame-by-frame, instead of designing from scratch. This is the dominant approach when you have a proven template to clone — which, in production, is most of the time.

flowchart TB
    A[1. Save the original video] --> B[2. Extract contact sheets]
    B --> C[3. Map frames to scenes]
    C --> D[4. Extract reference frames per scene]
    D --> E[5. Hand prompts to Image / Veo Prompter]
    E --> F[6. Build .nbflow]
    F --> G[7. Generate]
    G --> H[8. Compare against source - regenerate as needed]

The hard rule

Only generate what exists in the source video. Everything else flows from this.

Match the source's:

  • Scene count
  • Hard cuts and framing shifts (1:1)
  • Duration per scene
  • Visual structure (speaking shots vs B-roll cutaways vs end-frame transitions)

Do NOT add:

  • Cutaways the source doesn't have ("companion close-ups", "B-roll for visual reinforcement", "alternative angles for editing flexibility")
  • Additional scenes ("a stronger hook", "a CTA card", "an ingredient close-up to make it more visual")
  • Prehook clips the source doesn't have
  • End-frame transitions the source doesn't have

If your storyboard produces more shots than the source has, you're inventing content

Stop and re-read the contact sheets. A video-copy workflow that adds scenes isn't a copy — it's a derivative work, and it loses the proven structure that made the source viral in the first place.

Example session — catching the urge to invent

You: copy this remedy reel — link below — but I think we should add
     a hero ingredient close-up before the demo, would punch harder.

Claude:
  Pulling contact sheets from the source... done.
  Counting hard cuts: 5 scenes, all speaking, no B-roll cutaways.

  An ingredient close-up would be a B-roll cutaway. The source has
  zero B-roll. If I add one, this stops being a copy — it's a derivative
  with our own structural change. Two options:

  1. True copy: 5 speaking scenes, no cutaways. Ships as a Lvl 0
     variant of the source.
  2. Lvl 4 structural change on top of the copy: add the ingredient
     close-up as scene 03b. Goes into testing because it's a structural
     change. Will need its own validation pass.

  Which do you want?

You: 1. Run a clean copy first, we can layer the cutaway later if
     the copy underperforms.

That's the rule in action. The default answer to "should we improve the source" is "copy first, improve in a subsequent V0-N or as a Lvl ¾ variant."

Step 1: Save the original video

Always download and save the original before any analysis. Links go dead, accounts get deleted. Save first, analyze second.

yt-dlp -f "sd" -o "projects/{month}/{brand}/Assets/{workflow}/source.mp4" "<URL>"

The -f "sd" flag picks a reasonable standard-definition quality — enough for contact sheets and reference frames, not so much that the download takes forever. Bump to -f "hd" if you need higher-resolution reference frames for fine detail (rare).

If yt-dlp can't reach the source (private account, geo-blocked, etc.), use a browser screen recorder to capture it locally. Don't skip this step — without the source file, you can't extract reference frames later.

Step 2: Extract contact sheets

Use ffmpeg to extract a grid of frames sampled at 2fps, tiled 3x3 per image:

mkdir -p projects/{month}/{brand}/Assets/{workflow}/contact_sheets
ffmpeg -i source.mp4 \
  -vf "fps=2,tile=3x3" \
  -q:v 2 \
  projects/{month}/{brand}/Assets/{workflow}/contact_sheets/sheet_%02d.jpg

Output: a series of 3x3 grids (9 frames per image), 2 seconds of footage per row at 2fps.

Why 2fps and 3x3

A 60-second video at 2fps = 120 frames = ~13 contact sheets at ~10K tokens each. Total: ~130K tokens for full coverage of a 60s video — manageable for a single agent's context.

For a 30-second video, bump to 4fps (8 sheets, ~80K tokens). For a 90-second video, drop to 1.5fps (10 sheets, ~100K tokens).

Read every contact sheet

Before any prompt is written, read all contact sheets and build a precise frame-by-frame action breakdown. Note:

  • Where each hard cut happens (timestamp)
  • What changes between cuts (location, framing, subject pose)
  • Which frames are speaking shots vs B-roll cutaways
  • Which shots are static vs which have visible motion

Step 3: Map frames to scenes

A "scene" is the unit between two hard cuts. Each scene becomes one Veo clip.

Count hard cuts in the source

This defines the scene count. A 30-second video with 4 hard cuts produces 5 scenes. Don't add or remove scenes — the source's structure is the structure.

Classify each scene

Scene type What it is How to build
Speaking scene Avatar in frame, delivering dialogue One image gen (start frame, often used as end frame too) + one Veo3 clip
B-roll cutaway Speaker NOT in frame, audio continues over different visual One image gen + one Veo3 clip. Run manually, not wired into the .nbflow.
Transformation clip Single clip showing a change (before → after, diseased → healthy, empty → full) Two image gens (start + end frames showing the transformation) + one Veo3 clip with both frames wired in
Per-step sequence Recipe with multiple ingredients, demonstration with steps One image gen per step + one Veo3 clip per step. Trim each to 2-3s in post.
Low-movement scene Talking head, static framing, minimal motion Same image as start AND end frame (prevents drift)

B-roll detection — critical

For video copies specifically: rewatch the source and ask:

Are there hard cuts to clips where the speaker is NOT in frame, with their audio continuing?

  • Yes → those are B-roll. Note them as B-roll scenes.
  • No → the workflow has no B-roll. Set B-roll density to None.

Don't invent B-roll

Adding B-roll because the workflow "would benefit from cutaways" is a video-copy violation. If the source has no B-roll, the copy has no B-roll. Period.

What's not B-roll:

  • A framing shift within the same continuous speaking sequence (camera moves closer, avatar repositions) — that's the same scene
  • A scene cut to a different location where the speaker is still on camera — that's a new speaking scene, not B-roll

Step 4: Extract reference frames per scene

For each scene, pull the single best frame from the original video that captures the target composition. Save them to projects/{month}/{brand}/Assets/{workflow}/reference_frames/scene_NN.jpg.

# Pull a frame at a specific timestamp
ffmpeg -i source.mp4 -ss 00:00:08.5 -vframes 1 -q:v 2 \
  projects/{month}/{brand}/Assets/{workflow}/reference_frames/scene_01.jpg

Use this image as the visual reference when Claude builds prompts. "Make it look like THIS" beats a paragraph of description, especially for:

  • Complex B-roll (organs, recipe close-ups, 3D renders)
  • Specific camera angles you can't describe precisely
  • Lighting setups that defy categorization

Step 5: Prompts get built

For each scene, Claude hands the reference frame + a brief scene description to the right agent:

Image prompts
The Image Prompter reads the reference frame for visual details and emits a JSON prompt. It needs: subject (avatar archetype), wardrobe (per account), environment, framing, key composition notes.
Speaking video prompts
The Veo Prompter uses the universal talking-head template — only the dialogue line varies. It needs: the dialogue line + any pose/motion notes per scene.
B-roll video prompts
Also Veo Prompter. Natural language, no dialogue, ambient audio only. It needs: the reference frame + brief motion description (e.g. "slow pour from kettle into mug").

You don't drive this — once you've approved the scene mapping, prompts get built automatically.

What's happening behind the scenes

The Manager hands each scene's brief to the right sub-agent. The Image Prompter and Veo Prompter both READ the reference frames you extracted — that's why frame extraction matters. They're not working from a description; they're working from the actual visual.

Each agent applies its own expertise (Image Prompter handles the in-frame rule, POV rule, framing language; Veo Prompter handles macro-vs-micro motion, the universal talking-head template). The Manager doesn't pre-write prompts and ask agents to format them — that breaks delegation and produces worse output.

Step 6: Build the .nbflow

The PatchWork Importer assembles all the prompt files into a working .nbflow:

  • All the prompt files from step 5
  • Scene structure (which scenes share reference images, which have start+end frame pairs)
  • Reference image groups (avatar character ref per account, product photo, any other shared refs)

For a multi-account video copy, the PatchWork Importer builds one tab per account in the same .nbflow file. See the Fan-out Protocol for the multi-tab structure.

The output lands at projects/{month}/{brand}/{growth|sales}/testing/{workflow}-V0-1.nbflow.

Step 7: Generate

Run the Generation Runner. Before invoking, run the pre-generation sanity check to catch schema issues that would otherwise surface mid-generation.

Step 8: Compare against the source

This is the step that separates a good copy from a passable one. Open the -generated.nbflow in PatchWork and the source video side-by-side. For each scene, compare:

  • Framing matches source
  • Lighting matches source
  • Subject pose / energy matches source
  • Pacing of motion matches source

Where they diverge, identify the cause and regenerate:

Divergence Likely cause Fix
Framing wrong (different camera angle) Image prompt didn't pin the framing tightly Add explicit framing language to image prompt
Lighting wrong Image prompt didn't describe lighting Add lighting description matching source
Subject pose wrong Pose prompt was too vague Add explicit pose direction + reference frame
Motion too fast / too slow Veo prompt motion description wrong Tighten motion description (or add macro motion qualifiers)
Avatar's face drifts mid-clip Start/end frames too different Use same image as start AND end

Bump V0-N for each iteration. Aim for visual parity with the source before moving on.

Transformation clips — special handling

When a scene shows a transformation in a single clip (a wound healing, a glass filling, a face aging), you need two image gens (the "before" frame and the "after" frame) wired into one Veo3 node as start frame and end frame.

flowchart LR
    P1[Plain Prompt: before composition] --> NB1[NanobananaAPI<br/>before]
    P2[Plain Prompt: after composition] --> NB2[NanobananaAPI<br/>after]
    NB1 --> A1[Approve]
    NB2 --> A2[Approve]
    PT[Plain Prompt: transformation motion] --> V[Veo3]
    A1 --> V
    A2 --> V
    V --> A3[Approve]

Veo interpolates the visual change between the two frames. The Veo prompt describes the motion (slow fade, gradual transition, dissolve, etc.) but the visual change comes from the frame difference, not the prompt.

Per-step sequences — special handling

For recipes or demonstrations with multiple discrete steps, build one image gen per step and one Veo3 clip per step. Each clip is 8 seconds; trim to 2-3 seconds per step in post-production.

The dynamic prompt mechanism handles this cleanly: one Dynamic Prompt with N rows (one per step) feeds one template → one image gen produces N images → one Veo3 produces N clips. Each clip is trimmed in post.

Compliance with the framing rule

Reference frames extracted from the source dictate the framing of the generated outputs. When you pass a chest-up reference frame to the Image Prompter, the prompt should declare chest-up framing — and then describe only chest-up content. See Image Prompt Rules.

If the source has a tighter crop than what your account's avatar reference sheet shows, that's fine — the avatar ref locks the face, not the body or wardrobe outside the frame.

Checklist

  • Source video downloaded and saved to Assets/{workflow}/source.mp4
  • Contact sheets extracted at 2fps
  • All contact sheets read; frame-by-frame action breakdown built
  • Hard cuts counted; scene count locked to source
  • Each scene classified (speaking / B-roll / transformation / per-step / low-movement)
  • B-roll presence confirmed (or confirmed absent — density set accordingly)
  • Reference frame extracted per scene
  • Image and Veo prompts built, with reference frames attached
  • .nbflow built and saved as V0-1.nbflow
  • Pre-generation sanity check passes
  • Generation Runner completes
  • Outputs reviewed against source — divergences flagged and regenerated
  • Iteration complete; V0-N clean on test account
  • Fan-out to other accounts via Fan-out Protocol

What NOT to do

  • Generate before reading the contact sheets in full
  • Add scenes the source doesn't have
  • Add B-roll where the source has none
  • Use start-only frames for transformation clips
  • Skip reference frame extraction (description alone is rarely tight enough)
  • Trim post-generation when you could have trimmed at the storyboard stage (saves Veo budget)
  • Treat a video copy as a starting point for "improvements" — copy first, improve in subsequent V0-N iterations or as a Lvl ¾ variant

When you're ready

Next: Compliance Audit. Whatever you build, run a compliance check before fan-out. Banned-word slip-ups can tank a video on TikTok Shop.