PiP (Picture-in-Picture) Format¶
A mixed-media composition pattern: the avatar appears as a small overlay on top of a rendered background scene, instead of full-frame talking head. Used for recipe demos, anatomy explainers, and other content where the visual context matters more than the speaker.
What PiP looks like¶
┌──────────────────────────────────┐
│ │
│ [background scene fills │
│ most of the frame] │
│ │
│ │
│ ┌──────────┐ │
│ │ avatar │ │
│ │ overlay │ │
│ └──────────┘ │
└──────────────────────────────────┘
The avatar is typically a small circle or rounded rectangle in a corner — usually bottom-right or top-right. The background fills the rest of the frame.
When PiP works well¶
| Format | PiP fit |
|---|---|
| Recipe demo (showing ingredients getting mixed) | Strong fit — background shows the cooking, avatar adds commentary |
| Anatomy explainer (showing organs / mechanism) | Strong fit — background shows the science, avatar narrates |
| Product unboxing / demo | Strong fit — background shows the product, avatar reacts |
| Educational explainer with diagrams | Strong fit — background shows the diagrams, avatar teaches |
| Pure talking head | NOT a fit — no useful background context |
| Story / confessional | NOT a fit — full-frame avatar is more intimate |
| Quick CTA / direct sales | NOT a fit — too much going on visually |
How PiP is built¶
PiP isn't a single image gen + Veo gen. It's two streams composited in post:
flowchart LR
A[Background image gen<br/>NanoBanana 2] --> A1[Approve]
A1 --> A2[Veo clip of background<br/>animated]
A2 --> Post[Post-production<br/>compositing]
B[Avatar image gen<br/>selfie on white bg] --> B1[Approve]
B1 --> B2[Veo clip of avatar<br/>talking head]
B2 --> Post
Post --> Final[Final PiP video]
The background gets its own image + Veo gen. The avatar gets its own image + Veo gen (specifically, a selfie-style image on a white/neutral background that can be keyed out in post). The editor composites them together — avatar overlaid on the animated background.
Why the avatar uses a white background¶
The avatar's image gen for PiP is a selfie shot on a clean white or neutral background. Not in the actual setting where the background will be.
Why: the editor needs to key out the background of the avatar shot to overlay it on the rendered background scene. A consistent neutral background makes the keying clean. A real-setting background would have textures that confuse the keyer.
This is one of the rare cases where the avatar's environment in the image prompt differs from the workflow's setting.
Background generation¶
The background is its own image gen with its own prompt. It needs to:
- Show the content context (ingredients, anatomy, product, diagram)
- Be animatable (Veo will turn it into a moving clip)
- Have space for the avatar overlay (typically the avatar will land in a corner; the background should have visual interest in the rest of the frame)
Common background scenes:
- Recipe ingredients on a marble countertop being mixed
- A cross-section of organ tissue being illustrated
- A product bottle being held / poured / opened
- A diagram or chart with elements appearing one by one
PiP and B-roll density¶
PiP is itself a format, not a B-roll density level. A PiP workflow has:
- The PiP main video (background + avatar PiP composite)
- Optionally some B-roll cutaways (more rare, since the PiP already shows the visual context)
PiP workflows typically have Low or None B-roll density — the PiP format itself is doing the visual-context job that B-roll usually does in talking-head formats.
When the brief should specify PiP¶
If the brief is for:
- A recipe explainer
- An ingredient walkthrough
- A demo / unboxing
- An educational anatomy / mechanism video
The brief should specify PiP format explicitly. Otherwise the Visual Planner defaults to talking-head, and you end up with a workflow that doesn't have the visual richness the content needs.
How PiP affects fan-out¶
PiP is generally less customized per account than talking-head:
- The background is usually STANDARD across all accounts (same ingredients, same diagrams, same product shots)
- The avatar overlay is CUSTOMIZED per account (each account's avatar in the overlay)
This makes PiP fan-out simpler than talking-head fan-out — most of the visual content is shared.
PiP variant considerations¶
For variants on a PiP workflow:
- Lvl 1 dialogue swap — touches the avatar's voiceover; background stays the same
- Lvl 2 wardrobe swap — touches the avatar overlay only; background stays
- Lvl 3 background change — the bigger visual change happens in the background, not the avatar
- Lvl 4 structural — adding / removing steps in the demo
Most variants on PiP are background-focused, not avatar-focused (the opposite of talking-head).
Recipe-specific PiP pattern¶
A common pattern: recipe demonstrations.
The background shows the recipe steps in sequence — ingredients laid out, then being combined, then the final dish. The avatar overlay narrates each step.
Each step is its own scene:
| Scene | Background | Avatar overlay |
|---|---|---|
| Scene 01 | Empty counter, ingredients laid out | Avatar introducing the recipe |
| Scene 02 | First ingredient being added | Avatar explaining why |
| Scene 03 | Mixing | Avatar tip / variation |
| Scene 04 | Adding next ingredient | Avatar's commentary |
| Scene 05 | Final dish | Avatar wrap-up + soft CTA |
This is technically a per-step sequence (each step is its own clip), combined with PiP composition. The two patterns layer together.
When you're ready¶
→ Next: Storyboarding Logic — the 8-second Veo clip constraint, word counts per scene, where B-roll insertion points land.