Storyboarding Logic¶
How a script becomes a sequence of scenes — and the constraints that shape every storyboarding decision.
The fundamental constraint¶
That's the hard ceiling. Every speaking scene in your storyboard is one Veo clip — so every speaking scene must fit in ~8 seconds of dialogue + pause + delivery.
A 60-second script becomes roughly 8 scenes of 7-8 seconds each. A 90-second script becomes roughly 12 scenes.
Word budget per scene¶
In English, the typical conversational delivery rate is 150-170 words per minute (2.5-2.8 words per second).
For an 8-second scene:
| Delivery style | Word count target |
|---|---|
| Calm / measured (e.g., warm naturopath) | 18-22 words |
| Standard conversational | 20-24 words |
| Quick / energetic | 24-28 words |
| Maximum reasonable in 8s | 30 words |
If a body segment of the script is much longer than ~28 words, it needs to be split into two scenes.
The Visual Planner does this math automatically when building the storyboard. You see the result; you don't need to count words yourself.
Scene boundaries¶
The Visual Planner places scene breaks at natural beat boundaries in the script:
At a paragraph break- The script's HOOK → BODY → CTA structure already gives natural breaks. Scene 01 = HOOK, Scene 02+ = BODY beats, last scene = CTA.
At a meaning shift within a paragraph- "Brain fog. Mood swings. 3am wake-ups. They feel separate — but they're not." That's two distinct beats: the symptom cluster, then the "but they're connected" pivot. Two scenes.
At a length necessitating a split- If a beat is too long for one Veo clip, it splits at the nearest grammatical pause (clause break, comma, sentence end).
Word count vs. delivery time — caveats¶
Word count is a proxy for delivery time, but it's not exact. Adjustments:
- Spanish is typically 10-20% longer than English. Translations may push a scene over budget.
- Pause-heavy delivery uses fewer words per scene but the same time
- Emotional beats (especially confessional / vulnerable moments) need slower pacing — fewer words
- High-energy beats can fit more words
If you notice a Veo clip's delivery sounds rushed, the scene's word count is probably too high for the energy level. Split it.
B-roll placement logic¶
B-roll cutaways interrupt speaking scenes. They're chosen at moments where:
| B-roll placement | When |
|---|---|
| Visual reinforcement | Speaker mentions a specific visual concept that's hard to imagine ("the wing-shaped molecule that does this...") |
| Emotional pause | A beat where the speaker pauses for effect — viewer's eye benefits from somewhere else to look |
| Mechanism explanation | Speaker is explaining HOW something works — show it instead of describe it |
| Product close-up | Speaker is introducing or referencing the product — show the product |
The Visual Planner chooses placements based on the B-roll density mode (High / Medium / Low / None).
Camera per scene¶
Each scene gets its own camera framing decision:
- Selfie / chest-up (most common) — talking head, intimate, casual
- Medium / waist-up — slightly more formal, allows hand gestures in frame
- Wide / full body — used sparingly, usually for transitional / contextual scenes
- Close-up / face-only — used for emotional peaks
- POV shot — first-person, hand reaching into frame holding the product (see Image Prompt Rules - POV)
Workflows usually settle on one primary framing with 1-2 deviations for emphasis. Constantly varying the framing reads as jumpy.
Pose per scene¶
Subject pose per scene typically varies subtly:
| Scene type | Pose |
|---|---|
| Hook | Slight forward lean, direct gaze, engaged |
| Body — explainer beat | Settled, balanced, gesturing as appropriate |
| Body — emphasis beat | Forward lean intensified |
| Body — confessional / vulnerable | Slightly more relaxed, weight shifted |
| CTA | Engaged again, energy bumped |
The Visual Planner specifies the pose per scene. The Image Prompter writes it into the image prompt.
Lighting per scene¶
For most workflows, lighting stays consistent across all scenes — same direction, same quality. Changing lighting mid-workflow reads as a new location, which usually isn't what you want.
Exceptions:
- A specific scene is meant to feel different (a darker / more intimate beat)
- The workflow spans multiple settings (e.g., kitchen for some scenes, bathroom for others)
- A transformation clip uses lighting change as the transformation
The default: same lighting across all speaking scenes.
Pacing decisions¶
The Visual Planner also decides the rhythm of the workflow:
Average scene length- If most scenes are 7-8 seconds, you have steady pacing. If you mix 4-second beats with 8-second beats, you have rhythmic pacing (faster cuts feel more energetic).
Hook duration- Usually 4-6 seconds. Long enough to deliver the hook line, short enough that the viewer commits to staying.
CTA duration- Usually 4-6 seconds. Brief — the viewer doesn't need a lot of time to decide.
Body proportion- Body should be the bulk of the workflow. If hook + CTA exceed ⅓ of total runtime, the body is being squeezed.
When the script doesn't fit the constraint¶
A 90-second script with a 30-word body sentence somewhere doesn't fit. The Visual Planner has to either:
- Split the sentence into two scenes (mechanical — split at a clause break)
- Trim the script (asks the Script Writer to tighten that sentence)
- Use B-roll over part of the line (the audio continues, but visually we cut to B-roll while the speaker reads the long sentence)
The Visual Planner picks one approach per situation. Splitting is the most common.
Storyboarding for video copy¶
When the workflow is a video copy, the storyboarding isn't decided fresh — it matches the source video's structure.
The contact-sheet analysis tells you:
- Where the source's hard cuts are (each cut = one scene)
- How long each scene runs (matches your scene durations)
- What framing the source uses per scene (matches your framing)
- Whether B-roll is present (matches your B-roll density)
Video copy storyboarding is mechanical mapping. From-scratch storyboarding is creative.
When you're ready¶
→ Next: The .nbflow Schema (light) — what's actually inside a workflow file, at a level useful for understanding what the agents produce.