Skip to content

PiP (Picture-in-Picture) Format

A mixed-media composition pattern: the avatar appears as a small overlay on top of a rendered background scene, instead of full-frame talking head. Used for recipe demos, anatomy explainers, and other content where the visual context matters more than the speaker.

What PiP looks like

┌──────────────────────────────────┐
│                                  │
│     [background scene fills      │
│        most of the frame]        │
│                                  │
│                                  │
│                  ┌──────────┐    │
│                  │  avatar  │    │
│                  │ overlay  │    │
│                  └──────────┘    │
└──────────────────────────────────┘

The avatar is typically a small circle or rounded rectangle in a corner — usually bottom-right or top-right. The background fills the rest of the frame.

When PiP works well

Format PiP fit
Recipe demo (showing ingredients getting mixed) Strong fit — background shows the cooking, avatar adds commentary
Anatomy explainer (showing organs / mechanism) Strong fit — background shows the science, avatar narrates
Product unboxing / demo Strong fit — background shows the product, avatar reacts
Educational explainer with diagrams Strong fit — background shows the diagrams, avatar teaches
Pure talking head NOT a fit — no useful background context
Story / confessional NOT a fit — full-frame avatar is more intimate
Quick CTA / direct sales NOT a fit — too much going on visually

How PiP is built

PiP isn't a single image gen + Veo gen. It's two streams composited in post:

flowchart LR
    A[Background image gen<br/>NanoBanana 2] --> A1[Approve]
    A1 --> A2[Veo clip of background<br/>animated]
    A2 --> Post[Post-production<br/>compositing]
    B[Avatar image gen<br/>selfie on white bg] --> B1[Approve]
    B1 --> B2[Veo clip of avatar<br/>talking head]
    B2 --> Post
    Post --> Final[Final PiP video]

The background gets its own image + Veo gen. The avatar gets its own image + Veo gen (specifically, a selfie-style image on a white/neutral background that can be keyed out in post). The editor composites them together — avatar overlaid on the animated background.

Why the avatar uses a white background

The avatar's image gen for PiP is a selfie shot on a clean white or neutral background. Not in the actual setting where the background will be.

Why: the editor needs to key out the background of the avatar shot to overlay it on the rendered background scene. A consistent neutral background makes the keying clean. A real-setting background would have textures that confuse the keyer.

This is one of the rare cases where the avatar's environment in the image prompt differs from the workflow's setting.

Background generation

The background is its own image gen with its own prompt. It needs to:

  • Show the content context (ingredients, anatomy, product, diagram)
  • Be animatable (Veo will turn it into a moving clip)
  • Have space for the avatar overlay (typically the avatar will land in a corner; the background should have visual interest in the rest of the frame)

Common background scenes:

  • Recipe ingredients on a marble countertop being mixed
  • A cross-section of organ tissue being illustrated
  • A product bottle being held / poured / opened
  • A diagram or chart with elements appearing one by one

PiP and B-roll density

PiP is itself a format, not a B-roll density level. A PiP workflow has:

  • The PiP main video (background + avatar PiP composite)
  • Optionally some B-roll cutaways (more rare, since the PiP already shows the visual context)

PiP workflows typically have Low or None B-roll density — the PiP format itself is doing the visual-context job that B-roll usually does in talking-head formats.

When the brief should specify PiP

If the brief is for:

  • A recipe explainer
  • An ingredient walkthrough
  • A demo / unboxing
  • An educational anatomy / mechanism video

The brief should specify PiP format explicitly. Otherwise the Visual Planner defaults to talking-head, and you end up with a workflow that doesn't have the visual richness the content needs.

How PiP affects fan-out

PiP is generally less customized per account than talking-head:

  • The background is usually STANDARD across all accounts (same ingredients, same diagrams, same product shots)
  • The avatar overlay is CUSTOMIZED per account (each account's avatar in the overlay)

This makes PiP fan-out simpler than talking-head fan-out — most of the visual content is shared.

PiP variant considerations

For variants on a PiP workflow:

  • Lvl 1 dialogue swap — touches the avatar's voiceover; background stays the same
  • Lvl 2 wardrobe swap — touches the avatar overlay only; background stays
  • Lvl 3 background change — the bigger visual change happens in the background, not the avatar
  • Lvl 4 structural — adding / removing steps in the demo

Most variants on PiP are background-focused, not avatar-focused (the opposite of talking-head).

Recipe-specific PiP pattern

A common pattern: recipe demonstrations.

The background shows the recipe steps in sequence — ingredients laid out, then being combined, then the final dish. The avatar overlay narrates each step.

Each step is its own scene:

Scene Background Avatar overlay
Scene 01 Empty counter, ingredients laid out Avatar introducing the recipe
Scene 02 First ingredient being added Avatar explaining why
Scene 03 Mixing Avatar tip / variation
Scene 04 Adding next ingredient Avatar's commentary
Scene 05 Final dish Avatar wrap-up + soft CTA

This is technically a per-step sequence (each step is its own clip), combined with PiP composition. The two patterns layer together.

When you're ready

Next: Storyboarding Logic — the 8-second Veo clip constraint, word counts per scene, where B-roll insertion points land.