Storyboarding Logic¶

How a script becomes a sequence of scenes — and the constraints that shape every storyboarding decision.

The fundamental constraint¶

1 Veo 3.1 clip = ~8 seconds

That's the hard ceiling. Every speaking scene in your storyboard is one Veo clip — so every speaking scene must fit in ~8 seconds of dialogue + pause + delivery.

A 60-second script becomes roughly 8 scenes of 7-8 seconds each. A 90-second script becomes roughly 12 scenes.

Word budget per scene¶

In English, the typical conversational delivery rate is 150-170 words per minute (2.5-2.8 words per second).

For an 8-second scene:

Delivery style	Word count target
Calm / measured (e.g., warm naturopath)	18-22 words
Standard conversational	20-24 words
Quick / energetic	24-28 words
Maximum reasonable in 8s	30 words

If a body segment of the script is much longer than ~28 words, it needs to be split into two scenes.

The Visual Planner does this math automatically when building the storyboard. You see the result; you don't need to count words yourself.

Scene boundaries¶

The Visual Planner places scene breaks at natural beat boundaries in the script:

At a paragraph break: The script's HOOK → BODY → CTA structure already gives natural breaks. Scene 01 = HOOK, Scene 02+ = BODY beats, last scene = CTA.
At a meaning shift within a paragraph: "Brain fog. Mood swings. 3am wake-ups. They feel separate — but they're not." That's two distinct beats: the symptom cluster, then the "but they're connected" pivot. Two scenes.
At a length necessitating a split: If a beat is too long for one Veo clip, it splits at the nearest grammatical pause (clause break, comma, sentence end).

Word count vs. delivery time — caveats¶

Word count is a proxy for delivery time, but it's not exact. Adjustments:

Spanish is typically 10-20% longer than English. Translations may push a scene over budget.
Pause-heavy delivery uses fewer words per scene but the same time
Emotional beats (especially confessional / vulnerable moments) need slower pacing — fewer words
High-energy beats can fit more words

If you notice a Veo clip's delivery sounds rushed, the scene's word count is probably too high for the energy level. Split it.

B-roll placement logic¶

B-roll cutaways interrupt speaking scenes. They're chosen at moments where:

B-roll placement	When
Visual reinforcement	Speaker mentions a specific visual concept that's hard to imagine ("the wing-shaped molecule that does this...")
Emotional pause	A beat where the speaker pauses for effect — viewer's eye benefits from somewhere else to look
Mechanism explanation	Speaker is explaining HOW something works — show it instead of describe it
Product close-up	Speaker is introducing or referencing the product — show the product

The Visual Planner chooses placements based on the B-roll density mode (High / Medium / Low / None).

Camera per scene¶

Each scene gets its own camera framing decision:

Selfie / chest-up (most common) — talking head, intimate, casual
Medium / waist-up — slightly more formal, allows hand gestures in frame
Wide / full body — used sparingly, usually for transitional / contextual scenes
Close-up / face-only — used for emotional peaks
POV shot — first-person, hand reaching into frame holding the product (see Image Prompt Rules - POV)

Workflows usually settle on one primary framing with 1-2 deviations for emphasis. Constantly varying the framing reads as jumpy.

Pose per scene¶

Subject pose per scene typically varies subtly:

Scene type	Pose
Hook	Slight forward lean, direct gaze, engaged
Body — explainer beat	Settled, balanced, gesturing as appropriate
Body — emphasis beat	Forward lean intensified
Body — confessional / vulnerable	Slightly more relaxed, weight shifted
CTA	Engaged again, energy bumped

The Visual Planner specifies the pose per scene. The Image Prompter writes it into the image prompt.

Lighting per scene¶

For most workflows, lighting stays consistent across all scenes — same direction, same quality. Changing lighting mid-workflow reads as a new location, which usually isn't what you want.

Exceptions:

A specific scene is meant to feel different (a darker / more intimate beat)
The workflow spans multiple settings (e.g., kitchen for some scenes, bathroom for others)
A transformation clip uses lighting change as the transformation

The default: same lighting across all speaking scenes.

Pacing decisions¶

The Visual Planner also decides the rhythm of the workflow:

Average scene length: If most scenes are 7-8 seconds, you have steady pacing. If you mix 4-second beats with 8-second beats, you have rhythmic pacing (faster cuts feel more energetic).
Hook duration: Usually 4-6 seconds. Long enough to deliver the hook line, short enough that the viewer commits to staying.
CTA duration: Usually 4-6 seconds. Brief — the viewer doesn't need a lot of time to decide.
Body proportion: Body should be the bulk of the workflow. If hook + CTA exceed ⅓ of total runtime, the body is being squeezed.

When the script doesn't fit the constraint¶

A 90-second script with a 30-word body sentence somewhere doesn't fit. The Visual Planner has to either:

Split the sentence into two scenes (mechanical — split at a clause break)
Trim the script (asks the Script Writer to tighten that sentence)
Use B-roll over part of the line (the audio continues, but visually we cut to B-roll while the speaker reads the long sentence)

The Visual Planner picks one approach per situation. Splitting is the most common.

Storyboarding for video copy¶

When the workflow is a video copy, the storyboarding isn't decided fresh — it matches the source video's structure.

The contact-sheet analysis tells you:

Where the source's hard cuts are (each cut = one scene)
How long each scene runs (matches your scene durations)
What framing the source uses per scene (matches your framing)
Whether B-roll is present (matches your B-roll density)

Video copy storyboarding is mechanical mapping. From-scratch storyboarding is creative.

When you're ready¶

→ Next: The .nbflow Schema (light) — what's actually inside a workflow file, at a level useful for understanding what the agents produce.