Video Copy Workflow¶

The method for building a workflow by replicating an existing viral video frame-by-frame, instead of designing from scratch. This is the dominant approach when you have a proven template to clone — which, in production, is most of the time.

flowchart TB
    A[1. Save the original video] --> B[2. Extract contact sheets]
    B --> C[3. Map frames to scenes]
    C --> D[4. Extract reference frames per scene]
    D --> E[5. Hand prompts to Image / Veo Prompter]
    E --> F[6. Build .nbflow]
    F --> G[7. Generate]
    G --> H[8. Compare against source - regenerate as needed]

The hard rule¶

Only generate what exists in the source video. Everything else flows from this.

Match the source's:

Scene count
Hard cuts and framing shifts (1:1)
Duration per scene
Visual structure (speaking shots vs B-roll cutaways vs end-frame transitions)

Do NOT add:

Cutaways the source doesn't have ("companion close-ups", "B-roll for visual reinforcement", "alternative angles for editing flexibility")
Additional scenes ("a stronger hook", "a CTA card", "an ingredient close-up to make it more visual")
Prehook clips the source doesn't have
End-frame transitions the source doesn't have

If your storyboard produces more shots than the source has, you're inventing content

Stop and re-read the contact sheets. A video-copy workflow that adds scenes isn't a copy — it's a derivative work, and it loses the proven structure that made the source viral in the first place.

Example session — catching the urge to invent¶

You: copy this remedy reel — link below — but I think we should add
     a hero ingredient close-up before the demo, would punch harder.

Claude:
  Pulling contact sheets from the source... done.
  Counting hard cuts: 5 scenes, all speaking, no B-roll cutaways.

  An ingredient close-up would be a B-roll cutaway. The source has
  zero B-roll. If I add one, this stops being a copy — it's a derivative
  with our own structural change. Two options:

  1. True copy: 5 speaking scenes, no cutaways. Ships as a Lvl 0
     variant of the source.
  2. Lvl 4 structural change on top of the copy: add the ingredient
     close-up as scene 03b. Goes into testing because it's a structural
     change. Will need its own validation pass.

  Which do you want?

You: 1. Run a clean copy first, we can layer the cutaway later if
     the copy underperforms.

That's the rule in action. The default answer to "should we improve the source" is "copy first, improve in a subsequent V0-N or as a Lvl ¾ variant."

Step 1: Save the original video¶

Always download and save the original before any analysis. Links go dead, accounts get deleted. Save first, analyze second.

yt-dlp -f "sd" -o "projects/{month}/{brand}/Assets/{workflow}/source.mp4" "<URL>"

The -f "sd" flag picks a reasonable standard-definition quality — enough for contact sheets and reference frames, not so much that the download takes forever. Bump to -f "hd" if you need higher-resolution reference frames for fine detail (rare).

If yt-dlp can't reach the source (private account, geo-blocked, etc.), use a browser screen recorder to capture it locally. Don't skip this step — without the source file, you can't extract reference frames later.

Step 2: Extract contact sheets¶

Use ffmpeg to extract a grid of frames sampled at 2fps, tiled 3x3 per image:

mkdir -p projects/{month}/{brand}/Assets/{workflow}/contact_sheets
ffmpeg -i source.mp4 \
  -vf "fps=2,tile=3x3" \
  -q:v 2 \
  projects/{month}/{brand}/Assets/{workflow}/contact_sheets/sheet_%02d.jpg

Output: a series of 3x3 grids (9 frames per image), 2 seconds of footage per row at 2fps.

Why 2fps and 3x3¶

A 60-second video at 2fps = 120 frames = ~13 contact sheets at ~10K tokens each. Total: ~130K tokens for full coverage of a 60s video — manageable for a single agent's context.

For a 30-second video, bump to 4fps (8 sheets, ~80K tokens). For a 90-second video, drop to 1.5fps (10 sheets, ~100K tokens).

Read every contact sheet¶

Before any prompt is written, read all contact sheets and build a precise frame-by-frame action breakdown. Note:

Where each hard cut happens (timestamp)
What changes between cuts (location, framing, subject pose)
Which frames are speaking shots vs B-roll cutaways
Which shots are static vs which have visible motion

Step 3: Map frames to scenes¶

A "scene" is the unit between two hard cuts. Each scene becomes one Veo clip.

Count hard cuts in the source¶

This defines the scene count. A 30-second video with 4 hard cuts produces 5 scenes. Don't add or remove scenes — the source's structure is the structure.

Classify each scene¶

Scene type	What it is	How to build
Speaking scene	Avatar in frame, delivering dialogue	One image gen (start frame, often used as end frame too) + one Veo3 clip
B-roll cutaway	Speaker NOT in frame, audio continues over different visual	One image gen + one Veo3 clip. Run manually, not wired into the `.nbflow`.
Transformation clip	Single clip showing a change (before → after, diseased → healthy, empty → full)	Two image gens (start + end frames showing the transformation) + one Veo3 clip with both frames wired in
Per-step sequence	Recipe with multiple ingredients, demonstration with steps	One image gen per step + one Veo3 clip per step. Trim each to 2-3s in post.
Low-movement scene	Talking head, static framing, minimal motion	Same image as start AND end frame (prevents drift)

B-roll detection — critical¶

For video copies specifically: rewatch the source and ask:

Are there hard cuts to clips where the speaker is NOT in frame, with their audio continuing?

Yes → those are B-roll. Note them as B-roll scenes.
No → the workflow has no B-roll. Set B-roll density to None.

Don't invent B-roll

Adding B-roll because the workflow "would benefit from cutaways" is a video-copy violation. If the source has no B-roll, the copy has no B-roll. Period.

What's not B-roll:

A framing shift within the same continuous speaking sequence (camera moves closer, avatar repositions) — that's the same scene
A scene cut to a different location where the speaker is still on camera — that's a new speaking scene, not B-roll

Step 4: Extract reference frames per scene¶

For each scene, pull the single best frame from the original video that captures the target composition. Save them to projects/{month}/{brand}/Assets/{workflow}/reference_frames/scene_NN.jpg.

# Pull a frame at a specific timestamp
ffmpeg -i source.mp4 -ss 00:00:08.5 -vframes 1 -q:v 2 \
  projects/{month}/{brand}/Assets/{workflow}/reference_frames/scene_01.jpg

Use this image as the visual reference when Claude builds prompts. "Make it look like THIS" beats a paragraph of description, especially for:

Complex B-roll (organs, recipe close-ups, 3D renders)
Specific camera angles you can't describe precisely
Lighting setups that defy categorization

Step 5: Prompts get built¶

For each scene, Claude hands the reference frame + a brief scene description to the right agent:

Image prompts: The Image Prompter reads the reference frame for visual details and emits a JSON prompt. It needs: subject (avatar archetype), wardrobe (per account), environment, framing, key composition notes.
Speaking video prompts: The Veo Prompter uses the universal talking-head template — only the dialogue line varies. It needs: the dialogue line + any pose/motion notes per scene.
B-roll video prompts: Also Veo Prompter. Natural language, no dialogue, ambient audio only. It needs: the reference frame + brief motion description (e.g. "slow pour from kettle into mug").

You don't drive this — once you've approved the scene mapping, prompts get built automatically.

What's happening behind the scenes

The Manager hands each scene's brief to the right sub-agent. The Image Prompter and Veo Prompter both READ the reference frames you extracted — that's why frame extraction matters. They're not working from a description; they're working from the actual visual.

Each agent applies its own expertise (Image Prompter handles the in-frame rule, POV rule, framing language; Veo Prompter handles macro-vs-micro motion, the universal talking-head template). The Manager doesn't pre-write prompts and ask agents to format them — that breaks delegation and produces worse output.

Step 6: Build the `.nbflow`¶

The PatchWork Importer assembles all the prompt files into a working .nbflow:

All the prompt files from step 5
Scene structure (which scenes share reference images, which have start+end frame pairs)
Reference image groups (avatar character ref per account, product photo, any other shared refs)

For a multi-account video copy, the PatchWork Importer builds one tab per account in the same .nbflow file. See the Fan-out Protocol for the multi-tab structure.

The output lands at projects/{month}/{brand}/{growth|sales}/testing/{workflow}-V0-1.nbflow.

Step 7: Generate¶

Run the Generation Runner. Before invoking, run the pre-generation sanity check to catch schema issues that would otherwise surface mid-generation.

Step 8: Compare against the source¶

This is the step that separates a good copy from a passable one. Open the -generated.nbflow in PatchWork and the source video side-by-side. For each scene, compare:

Framing matches source
Lighting matches source
Subject pose / energy matches source
Pacing of motion matches source

Where they diverge, identify the cause and regenerate:

Divergence	Likely cause	Fix
Framing wrong (different camera angle)	Image prompt didn't pin the framing tightly	Add explicit framing language to image prompt
Lighting wrong	Image prompt didn't describe lighting	Add lighting description matching source
Subject pose wrong	Pose prompt was too vague	Add explicit pose direction + reference frame
Motion too fast / too slow	Veo prompt motion description wrong	Tighten motion description (or add macro motion qualifiers)
Avatar's face drifts mid-clip	Start/end frames too different	Use same image as start AND end

Bump V0-N for each iteration. Aim for visual parity with the source before moving on.

Transformation clips — special handling¶

When a scene shows a transformation in a single clip (a wound healing, a glass filling, a face aging), you need two image gens (the "before" frame and the "after" frame) wired into one Veo3 node as start frame and end frame.

flowchart LR
    P1[Plain Prompt: before composition] --> NB1[NanobananaAPI<br/>before]
    P2[Plain Prompt: after composition] --> NB2[NanobananaAPI<br/>after]
    NB1 --> A1[Approve]
    NB2 --> A2[Approve]
    PT[Plain Prompt: transformation motion] --> V[Veo3]
    A1 --> V
    A2 --> V
    V --> A3[Approve]

Veo interpolates the visual change between the two frames. The Veo prompt describes the motion (slow fade, gradual transition, dissolve, etc.) but the visual change comes from the frame difference, not the prompt.

Per-step sequences — special handling¶

For recipes or demonstrations with multiple discrete steps, build one image gen per step and one Veo3 clip per step. Each clip is 8 seconds; trim to 2-3 seconds per step in post-production.

The dynamic prompt mechanism handles this cleanly: one Dynamic Prompt with N rows (one per step) feeds one template → one image gen produces N images → one Veo3 produces N clips. Each clip is trimmed in post.

Compliance with the framing rule¶

Reference frames extracted from the source dictate the framing of the generated outputs. When you pass a chest-up reference frame to the Image Prompter, the prompt should declare chest-up framing — and then describe only chest-up content. See Image Prompt Rules.

If the source has a tighter crop than what your account's avatar reference sheet shows, that's fine — the avatar ref locks the face, not the body or wardrobe outside the frame.

Checklist¶

What NOT to do¶

Generate before reading the contact sheets in full
Add scenes the source doesn't have
Add B-roll where the source has none
Use start-only frames for transformation clips
Skip reference frame extraction (description alone is rarely tight enough)
Trim post-generation when you could have trimmed at the storyboard stage (saves Veo budget)
Treat a video copy as a starting point for "improvements" — copy first, improve in subsequent V0-N iterations or as a Lvl ¾ variant

When you're ready¶

→ Next: Compliance Audit. Whatever you build, run a compliance check before fan-out. Banned-word slip-ups can tank a video on TikTok Shop.