PiP (Picture-in-Picture) Format¶

A mixed-media composition pattern: the avatar appears as a small overlay on top of a rendered background scene, instead of full-frame talking head. Used for recipe demos, anatomy explainers, and other content where the visual context matters more than the speaker.

What PiP looks like¶

┌──────────────────────────────────┐
│                                  │
│     [background scene fills      │
│        most of the frame]        │
│                                  │
│                                  │
│                  ┌──────────┐    │
│                  │  avatar  │    │
│                  │ overlay  │    │
│                  └──────────┘    │
└──────────────────────────────────┘

The avatar is typically a small circle or rounded rectangle in a corner — usually bottom-right or top-right. The background fills the rest of the frame.

When PiP works well¶

Format	PiP fit
Recipe demo (showing ingredients getting mixed)	Strong fit — background shows the cooking, avatar adds commentary
Anatomy explainer (showing organs / mechanism)	Strong fit — background shows the science, avatar narrates
Product unboxing / demo	Strong fit — background shows the product, avatar reacts
Educational explainer with diagrams	Strong fit — background shows the diagrams, avatar teaches
Pure talking head	NOT a fit — no useful background context
Story / confessional	NOT a fit — full-frame avatar is more intimate
Quick CTA / direct sales	NOT a fit — too much going on visually

How PiP is built¶

PiP isn't a single image gen + Veo gen. It's two streams composited in post:

flowchart LR
    A[Background image gen<br/>NanoBanana 2] --> A1[Approve]
    A1 --> A2[Veo clip of background<br/>animated]
    A2 --> Post[Post-production<br/>compositing]
    B[Avatar image gen<br/>selfie on white bg] --> B1[Approve]
    B1 --> B2[Veo clip of avatar<br/>talking head]
    B2 --> Post
    Post --> Final[Final PiP video]

The background gets its own image + Veo gen. The avatar gets its own image + Veo gen (specifically, a selfie-style image on a white/neutral background that can be keyed out in post). The editor composites them together — avatar overlaid on the animated background.

Why the avatar uses a white background¶

The avatar's image gen for PiP is a selfie shot on a clean white or neutral background. Not in the actual setting where the background will be.

Why: the editor needs to key out the background of the avatar shot to overlay it on the rendered background scene. A consistent neutral background makes the keying clean. A real-setting background would have textures that confuse the keyer.

This is one of the rare cases where the avatar's environment in the image prompt differs from the workflow's setting.

Background generation¶

The background is its own image gen with its own prompt. It needs to:

Show the content context (ingredients, anatomy, product, diagram)
Be animatable (Veo will turn it into a moving clip)
Have space for the avatar overlay (typically the avatar will land in a corner; the background should have visual interest in the rest of the frame)

Common background scenes:

Recipe ingredients on a marble countertop being mixed
A cross-section of organ tissue being illustrated
A product bottle being held / poured / opened
A diagram or chart with elements appearing one by one

PiP and B-roll density¶

PiP is itself a format, not a B-roll density level. A PiP workflow has:

The PiP main video (background + avatar PiP composite)
Optionally some B-roll cutaways (more rare, since the PiP already shows the visual context)

PiP workflows typically have Low or None B-roll density — the PiP format itself is doing the visual-context job that B-roll usually does in talking-head formats.

When the brief should specify PiP¶

If the brief is for:

A recipe explainer
An ingredient walkthrough
A demo / unboxing
An educational anatomy / mechanism video

The brief should specify PiP format explicitly. Otherwise the Visual Planner defaults to talking-head, and you end up with a workflow that doesn't have the visual richness the content needs.

How PiP affects fan-out¶

PiP is generally less customized per account than talking-head:

The background is usually STANDARD across all accounts (same ingredients, same diagrams, same product shots)
The avatar overlay is CUSTOMIZED per account (each account's avatar in the overlay)

This makes PiP fan-out simpler than talking-head fan-out — most of the visual content is shared.

PiP variant considerations¶

For variants on a PiP workflow:

Lvl 1 dialogue swap — touches the avatar's voiceover; background stays the same
Lvl 2 wardrobe swap — touches the avatar overlay only; background stays
Lvl 3 background change — the bigger visual change happens in the background, not the avatar
Lvl 4 structural — adding / removing steps in the demo

Most variants on PiP are background-focused, not avatar-focused (the opposite of talking-head).

Recipe-specific PiP pattern¶

A common pattern: recipe demonstrations.

The background shows the recipe steps in sequence — ingredients laid out, then being combined, then the final dish. The avatar overlay narrates each step.

Each step is its own scene:

Scene	Background	Avatar overlay
Scene 01	Empty counter, ingredients laid out	Avatar introducing the recipe
Scene 02	First ingredient being added	Avatar explaining why
Scene 03	Mixing	Avatar tip / variation
Scene 04	Adding next ingredient	Avatar's commentary
Scene 05	Final dish	Avatar wrap-up + soft CTA

This is technically a per-step sequence (each step is its own clip), combined with PiP composition. The two patterns layer together.

When you're ready¶

→ Next: Storyboarding Logic — the 8-second Veo clip constraint, word counts per scene, where B-roll insertion points land.