Video Copy Workflow¶

The method for building a workflow by replicating an existing viral video frame-by-frame, instead of designing from scratch. This is the dominant approach when you have a proven template to clone.

flowchart TB
    A[1. Save the original video] --> B[2. Extract contact sheets]
    B --> C[3. Map frames to scenes]
    C --> D[4. Extract reference frames per scene]
    D --> E[5. Delegate prompts to Image / Veo Prompter]
    E --> F[6. Build .nbflow]
    F --> G[7. Generate]
    G --> H[8. Compare against source - regenerate as needed]

The hard rule¶

Only generate what exists in the source video. This is the foundational constraint that everything else follows from.

Match the source's:

Scene count
Hard cuts and framing shifts (1:1)
Duration per scene
Visual structure (speaking shots vs B-roll cutaways vs end-frame transitions)

Do NOT add:

Cutaways that aren't in the source ("companion close-ups", "B-roll for visual reinforcement", "alternative angles for editing flexibility")
Additional scenes ("a stronger hook", "a CTA card", "an ingredient close-up to make it more visual")
Prehook clips the source doesn't have
End frame transitions the source doesn't have

If your storyboard produces more shots than the source has, you're inventing content

Stop and re-read the contact sheets. A video-copy workflow that adds scenes is not a copy — it's a derivative work, and it loses the proven structure that made the source viral in the first place.

Step 1: Save the original video¶

Always download and save the original before any analysis. Links go dead, accounts get deleted. Save first, analyze second.

yt-dlp -f "sd" -o "projects/{month}/{brand}/Assets/{workflow}/source.mp4" "<URL>"

The -f "sd" flag picks a reasonable standard-definition quality — enough for contact sheets and reference frames, not so much that the download takes forever. Bump to -f "hd" if you need higher resolution reference frames for fine detail (rare).

If yt-dlp can't reach the source (private account, geo-blocked, etc.), use a browser screen recorder to capture it locally. Don't skip this step — without the source file, you can't extract reference frames later.

Step 2: Extract contact sheets¶

Use ffmpeg to extract a grid of frames sampled at 2fps, tiled 3x3 per image:

mkdir -p projects/{month}/{brand}/Assets/{workflow}/contact_sheets
ffmpeg -i source.mp4 \
  -vf "fps=2,tile=3x3" \
  -q:v 2 \
  projects/{month}/{brand}/Assets/{workflow}/contact_sheets/sheet_%02d.jpg

Output: a series of 3x3 grids (9 frames per image), 2 seconds of footage per row at 2fps.

Why 2fps and 3x3¶

A 60-second video at 2fps = 120 frames = ~13 contact sheets at ~10K tokens each. Total: ~130K tokens for full coverage of a 60s video — manageable for a single agent's context.

If you're working with a 30-second video, you can bump to 4fps (8 sheets, ~80K tokens). For a 90-second video, drop to 1.5fps (10 sheets, ~100K tokens).

Read every contact sheet¶

Before writing any prompts, read all contact sheets and build a precise frame-by-frame action breakdown. Note:

Where each hard cut happens (timestamp)
What changes between cuts (location, framing, subject pose)
Which frames are speaking shots vs B-roll cutaways
Which shots are static vs which have visible motion

Step 3: Map frames to scenes¶

A "scene" is the unit between two hard cuts. Each scene becomes one Veo clip.

Count hard cuts in the source¶

This defines the scene count. A 30-second video with 4 hard cuts produces 5 scenes. Don't add or remove scenes — the source's structure is the structure.

Classify each scene¶

Scene type	What it is	How to build
Speaking scene	Avatar in frame, delivering dialogue	One image gen (start frame, often used as end frame too) + one Veo3 clip
B-roll cutaway	Speaker NOT in frame, audio continues over different visual	One image gen + one Veo3 clip. Run manually, not wired into the `.nbflow`.
Transformation clip	Single clip showing a change (before → after, diseased → healthy, empty → full)	Two image gens (start + end frames showing the transformation) + one Veo3 clip with both frames wired in
Per-step sequence	Recipe with multiple ingredients, demonstration with steps	One image gen per step + one Veo3 clip per step. Trim each to 2-3s in post.
Low-movement scene	Talking head, static framing, minimal motion	Same image as start AND end frame (prevents drift)

B-roll detection — critical¶

For video copies specifically: rewatch the source and ask:

Are there hard cuts to clips where the speaker is NOT in frame, with their audio continuing?

Yes → those are B-roll. Note them as B-roll scenes.
No → the workflow has no B-roll. Set B-roll density to None.

Don't invent B-roll

Inventing B-roll because the workflow "would benefit from cutaways" is a video-copy violation. If the source has no B-roll, the copy has no B-roll. Period.

What's NOT B-roll:

A framing shift within the same continuous speaking sequence (camera moves closer, avatar repositions) — that's the same scene
A scene cut to a different location where the speaker is still on camera — that's a new speaking scene, not B-roll

Step 4: Extract reference frames per scene¶

For each scene, pull the single best frame from the original video that captures the target composition. Save them to projects/{month}/{brand}/Assets/{workflow}/reference_frames/scene_NN.jpg.

# Pull a frame at a specific timestamp
ffmpeg -i source.mp4 -ss 00:00:08.5 -vframes 1 -q:v 2 \
  projects/{month}/{brand}/Assets/{workflow}/reference_frames/scene_01.jpg

Use this image as the visual reference when delegating prompts. "Make it look like THIS" beats a paragraph of description, especially for:

Complex B-roll (organs, recipe close-ups, 3D renders)
Specific camera angles you can't describe precisely
Lighting setups that defy categorization

Step 5: Delegate prompts¶

For each scene, delegate to the right agent with the reference frame + a brief scene description:

Image prompts — Image Prompter: Pass the reference frame plus a brief description: subject (avatar archetype), wardrobe (per account), environment, framing, key composition notes. The agent READS the reference frame for visual details.
Speaking video prompts — Veo Prompter: Speaking scenes use the universal talking head template — only the dialogue line varies. Pass: the dialogue line + any pose/motion notes per scene.
B-roll video prompts — Veo Prompter: Natural language, no dialogue, ambient audio only. Pass: the reference frame + brief motion description (e.g. "slow pour from kettle into mug").

Step 6: Build the `.nbflow`¶

Behind the scenes, the PatchWork Importer with:

All the prompt files from step 5
Scene structure (which scenes share reference images, which have start+end frame pairs)
Reference image groups (avatar character ref per account, product photo, any other shared refs)

For a multi-account video copy, the PatchWork Importer builds one tab per account in the same .nbflow file. See Fan-out Protocol.

Save the .nbflow to projects/{month}/{brand}/{growth|sales}/testing/{workflow}-V0-1.nbflow.

Step 7: Generate¶

Run the Generation Runner. Before invoking, run the pre-generation sanity check to catch schema issues that would otherwise surface mid-generation.

Step 8: Compare against the source¶

This is the step that separates a good copy from a passable one. Open the -generated.nbflow in PatchWork and the source video side-by-side. For each scene, compare:

Framing matches source
Lighting matches source
Subject pose / energy matches source
Pacing of motion matches source

Where they diverge, identify the cause and regenerate:

Divergence	Likely cause	Fix
Framing wrong (different camera angle)	Image prompt didn't pin the framing tightly	Add explicit framing language to image prompt
Lighting wrong	Image prompt didn't describe lighting	Add lighting description matching source
Subject pose wrong	Pose prompt was too vague	Add explicit pose direction + reference frame
Motion too fast / too slow	Veo prompt motion description wrong	Tighten the motion description (or add macro motion qualifiers)
Avatar's face drifts mid-clip	Start/end frames too different	Use same image as start AND end

Bump V0-N for each iteration. Aim for visual parity with the source before moving on.

Transformation clips — special handling¶

When a scene shows a transformation in a single clip (a wound healing, a glass filling, a face aging), you need two image gens (the "before" frame and the "after" frame) wired into one Veo3 node as start frame and end frame.

flowchart LR
    P1[Plain Prompt: before composition] --> NB1[NanobananaAPI<br/>before]
    P2[Plain Prompt: after composition] --> NB2[NanobananaAPI<br/>after]
    NB1 --> A1[Approve]
    NB2 --> A2[Approve]
    PT[Plain Prompt: transformation motion] --> V[Veo3]
    A1 --> V
    A2 --> V
    V --> A3[Approve]

Veo interpolates the visual change between the two frames. The Veo prompt describes the motion (slow fade, gradual transition, dissolve, etc.) but the visual change comes from the frame difference, not the prompt.

Per-step sequences — special handling¶

For recipes or demonstrations with multiple discrete steps, build one image gen per step and one Veo3 clip per step. Each clip is 8 seconds; trim to 2-3 seconds per step in post-production.

The dynamic prompt mechanism handles this cleanly: one Dynamic Prompt with N rows (one per step) feeds one template → one image gen produces N images → one Veo3 produces N clips. Each clip is trimmed in post.

Compliance with the framing rule¶

Reference frames extracted from the source dictate the framing of the generated outputs. When you pass a chest-up reference frame to the Image Prompter, the prompt should declare chest-up framing — and then describe only chest-up content. See Image Prompt Rules.

If the source has a tighter crop than what your account's avatar reference sheet shows, that's fine — the avatar ref locks the face, not the body or wardrobe outside the frame.

Checklist¶

What NOT to do¶

Generate before reading the contact sheets in full
Add scenes the source doesn't have
Add B-roll where the source has none
Use start-only frames for transformation clips
Skip reference frame extraction (description alone is rarely tight enough)
Trim post-generation when you could have trimmed at the storyboard stage (saves Veo budget)
Treat a video copy as a starting point for "improvements" — copy first, improve in subsequent V0-N iterations if needed