Skip to content

Video Copy Workflow

The method for building a workflow by replicating an existing viral video frame-by-frame, instead of designing from scratch. This is the dominant approach when you have a proven template to clone.

flowchart TB
    A[1. Save the original video] --> B[2. Extract contact sheets]
    B --> C[3. Map frames to scenes]
    C --> D[4. Extract reference frames per scene]
    D --> E[5. Delegate prompts to Image / Veo Prompter]
    E --> F[6. Build .nbflow]
    F --> G[7. Generate]
    G --> H[8. Compare against source - regenerate as needed]

The hard rule

Only generate what exists in the source video. This is the foundational constraint that everything else follows from.

Match the source's:

  • Scene count
  • Hard cuts and framing shifts (1:1)
  • Duration per scene
  • Visual structure (speaking shots vs B-roll cutaways vs end-frame transitions)

Do NOT add:

  • Cutaways that aren't in the source ("companion close-ups", "B-roll for visual reinforcement", "alternative angles for editing flexibility")
  • Additional scenes ("a stronger hook", "a CTA card", "an ingredient close-up to make it more visual")
  • Prehook clips the source doesn't have
  • End frame transitions the source doesn't have

If your storyboard produces more shots than the source has, you're inventing content

Stop and re-read the contact sheets. A video-copy workflow that adds scenes is not a copy — it's a derivative work, and it loses the proven structure that made the source viral in the first place.

Step 1: Save the original video

Always download and save the original before any analysis. Links go dead, accounts get deleted. Save first, analyze second.

yt-dlp -f "sd" -o "projects/{month}/{brand}/Assets/{workflow}/source.mp4" "<URL>"

The -f "sd" flag picks a reasonable standard-definition quality — enough for contact sheets and reference frames, not so much that the download takes forever. Bump to -f "hd" if you need higher resolution reference frames for fine detail (rare).

If yt-dlp can't reach the source (private account, geo-blocked, etc.), use a browser screen recorder to capture it locally. Don't skip this step — without the source file, you can't extract reference frames later.

Step 2: Extract contact sheets

Use ffmpeg to extract a grid of frames sampled at 2fps, tiled 3x3 per image:

mkdir -p projects/{month}/{brand}/Assets/{workflow}/contact_sheets
ffmpeg -i source.mp4 \
  -vf "fps=2,tile=3x3" \
  -q:v 2 \
  projects/{month}/{brand}/Assets/{workflow}/contact_sheets/sheet_%02d.jpg

Output: a series of 3x3 grids (9 frames per image), 2 seconds of footage per row at 2fps.

Why 2fps and 3x3

A 60-second video at 2fps = 120 frames = ~13 contact sheets at ~10K tokens each. Total: ~130K tokens for full coverage of a 60s video — manageable for a single agent's context.

If you're working with a 30-second video, you can bump to 4fps (8 sheets, ~80K tokens). For a 90-second video, drop to 1.5fps (10 sheets, ~100K tokens).

Read every contact sheet

Before writing any prompts, read all contact sheets and build a precise frame-by-frame action breakdown. Note:

  • Where each hard cut happens (timestamp)
  • What changes between cuts (location, framing, subject pose)
  • Which frames are speaking shots vs B-roll cutaways
  • Which shots are static vs which have visible motion

Step 3: Map frames to scenes

A "scene" is the unit between two hard cuts. Each scene becomes one Veo clip.

Count hard cuts in the source

This defines the scene count. A 30-second video with 4 hard cuts produces 5 scenes. Don't add or remove scenes — the source's structure is the structure.

Classify each scene

Scene type What it is How to build
Speaking scene Avatar in frame, delivering dialogue One image gen (start frame, often used as end frame too) + one Veo3 clip
B-roll cutaway Speaker NOT in frame, audio continues over different visual One image gen + one Veo3 clip. Run manually, not wired into the .nbflow.
Transformation clip Single clip showing a change (before → after, diseased → healthy, empty → full) Two image gens (start + end frames showing the transformation) + one Veo3 clip with both frames wired in
Per-step sequence Recipe with multiple ingredients, demonstration with steps One image gen per step + one Veo3 clip per step. Trim each to 2-3s in post.
Low-movement scene Talking head, static framing, minimal motion Same image as start AND end frame (prevents drift)

B-roll detection — critical

For video copies specifically: rewatch the source and ask:

Are there hard cuts to clips where the speaker is NOT in frame, with their audio continuing?

  • Yes → those are B-roll. Note them as B-roll scenes.
  • No → the workflow has no B-roll. Set B-roll density to None.

Don't invent B-roll

Inventing B-roll because the workflow "would benefit from cutaways" is a video-copy violation. If the source has no B-roll, the copy has no B-roll. Period.

What's NOT B-roll:

  • A framing shift within the same continuous speaking sequence (camera moves closer, avatar repositions) — that's the same scene
  • A scene cut to a different location where the speaker is still on camera — that's a new speaking scene, not B-roll

Step 4: Extract reference frames per scene

For each scene, pull the single best frame from the original video that captures the target composition. Save them to projects/{month}/{brand}/Assets/{workflow}/reference_frames/scene_NN.jpg.

# Pull a frame at a specific timestamp
ffmpeg -i source.mp4 -ss 00:00:08.5 -vframes 1 -q:v 2 \
  projects/{month}/{brand}/Assets/{workflow}/reference_frames/scene_01.jpg

Use this image as the visual reference when delegating prompts. "Make it look like THIS" beats a paragraph of description, especially for:

  • Complex B-roll (organs, recipe close-ups, 3D renders)
  • Specific camera angles you can't describe precisely
  • Lighting setups that defy categorization

Step 5: Delegate prompts

For each scene, delegate to the right agent with the reference frame + a brief scene description:

Image promptsImage Prompter
Pass the reference frame plus a brief description: subject (avatar archetype), wardrobe (per account), environment, framing, key composition notes. The agent READS the reference frame for visual details.
Speaking video promptsVeo Prompter
Speaking scenes use the universal talking head template — only the dialogue line varies. Pass: the dialogue line + any pose/motion notes per scene.
B-roll video promptsVeo Prompter
Natural language, no dialogue, ambient audio only. Pass: the reference frame + brief motion description (e.g. "slow pour from kettle into mug").

Step 6: Build the .nbflow

Behind the scenes, the PatchWork Importer with:

  • All the prompt files from step 5
  • Scene structure (which scenes share reference images, which have start+end frame pairs)
  • Reference image groups (avatar character ref per account, product photo, any other shared refs)

For a multi-account video copy, the PatchWork Importer builds one tab per account in the same .nbflow file. See Fan-out Protocol.

Save the .nbflow to projects/{month}/{brand}/{growth|sales}/testing/{workflow}-V0-1.nbflow.

Step 7: Generate

Run the Generation Runner. Before invoking, run the pre-generation sanity check to catch schema issues that would otherwise surface mid-generation.

Step 8: Compare against the source

This is the step that separates a good copy from a passable one. Open the -generated.nbflow in PatchWork and the source video side-by-side. For each scene, compare:

  • Framing matches source
  • Lighting matches source
  • Subject pose / energy matches source
  • Pacing of motion matches source

Where they diverge, identify the cause and regenerate:

Divergence Likely cause Fix
Framing wrong (different camera angle) Image prompt didn't pin the framing tightly Add explicit framing language to image prompt
Lighting wrong Image prompt didn't describe lighting Add lighting description matching source
Subject pose wrong Pose prompt was too vague Add explicit pose direction + reference frame
Motion too fast / too slow Veo prompt motion description wrong Tighten the motion description (or add macro motion qualifiers)
Avatar's face drifts mid-clip Start/end frames too different Use same image as start AND end

Bump V0-N for each iteration. Aim for visual parity with the source before moving on.

Transformation clips — special handling

When a scene shows a transformation in a single clip (a wound healing, a glass filling, a face aging), you need two image gens (the "before" frame and the "after" frame) wired into one Veo3 node as start frame and end frame.

flowchart LR
    P1[Plain Prompt: before composition] --> NB1[NanobananaAPI<br/>before]
    P2[Plain Prompt: after composition] --> NB2[NanobananaAPI<br/>after]
    NB1 --> A1[Approve]
    NB2 --> A2[Approve]
    PT[Plain Prompt: transformation motion] --> V[Veo3]
    A1 --> V
    A2 --> V
    V --> A3[Approve]

Veo interpolates the visual change between the two frames. The Veo prompt describes the motion (slow fade, gradual transition, dissolve, etc.) but the visual change comes from the frame difference, not the prompt.

Per-step sequences — special handling

For recipes or demonstrations with multiple discrete steps, build one image gen per step and one Veo3 clip per step. Each clip is 8 seconds; trim to 2-3 seconds per step in post-production.

The dynamic prompt mechanism handles this cleanly: one Dynamic Prompt with N rows (one per step) feeds one template → one image gen produces N images → one Veo3 produces N clips. Each clip is trimmed in post.

Compliance with the framing rule

Reference frames extracted from the source dictate the framing of the generated outputs. When you pass a chest-up reference frame to the Image Prompter, the prompt should declare chest-up framing — and then describe only chest-up content. See Image Prompt Rules.

If the source has a tighter crop than what your account's avatar reference sheet shows, that's fine — the avatar ref locks the face, not the body or wardrobe outside the frame.

Checklist

  • Source video downloaded and saved to Assets/{workflow}/source.mp4
  • Contact sheets extracted at 2fps
  • All contact sheets read; frame-by-frame action breakdown built
  • Hard cuts counted; scene count locked to source
  • Each scene classified (speaking / B-roll / transformation / per-step / low-movement)
  • B-roll presence confirmed (or confirmed absent — density set accordingly)
  • Reference frame extracted per scene
  • Image and Veo prompts delegated, with reference frames attached
  • .nbflow built and saved as V0-1.nbflow
  • Pre-generation sanity check passes
  • Generation Runner completes
  • Outputs reviewed against source — divergences flagged and regenerated
  • Iteration complete; V0-N clean on test account
  • Fan-out to other accounts via Fan-out Protocol

What NOT to do

  • Generate before reading the contact sheets in full
  • Add scenes the source doesn't have
  • Add B-roll where the source has none
  • Use start-only frames for transformation clips
  • Skip reference frame extraction (description alone is rarely tight enough)
  • Trim post-generation when you could have trimmed at the storyboard stage (saves Veo budget)
  • Treat a video copy as a starting point for "improvements" — copy first, improve in subsequent V0-N iterations if needed