Skip to content

Storyboarding Logic

How a script becomes a sequence of scenes — and the constraints that shape every storyboarding decision.

The fundamental constraint

1 Veo 3.1 clip = ~8 seconds

That's the hard ceiling. Every speaking scene in your storyboard is one Veo clip — so every speaking scene must fit in ~8 seconds of dialogue + pause + delivery.

A 60-second script becomes roughly 8 scenes of 7-8 seconds each. A 90-second script becomes roughly 12 scenes.

Word budget per scene

In English, the typical conversational delivery rate is 150-170 words per minute (2.5-2.8 words per second).

For an 8-second scene:

Delivery style Word count target
Calm / measured (e.g., warm naturopath) 18-22 words
Standard conversational 20-24 words
Quick / energetic 24-28 words
Maximum reasonable in 8s 30 words

If a body segment of the script is much longer than ~28 words, it needs to be split into two scenes.

The Visual Planner does this math automatically when building the storyboard. You see the result; you don't need to count words yourself.

Scene boundaries

The Visual Planner places scene breaks at natural beat boundaries in the script:

At a paragraph break
The script's HOOK → BODY → CTA structure already gives natural breaks. Scene 01 = HOOK, Scene 02+ = BODY beats, last scene = CTA.
At a meaning shift within a paragraph
"Brain fog. Mood swings. 3am wake-ups. They feel separate — but they're not." That's two distinct beats: the symptom cluster, then the "but they're connected" pivot. Two scenes.
At a length necessitating a split
If a beat is too long for one Veo clip, it splits at the nearest grammatical pause (clause break, comma, sentence end).

Word count vs. delivery time — caveats

Word count is a proxy for delivery time, but it's not exact. Adjustments:

  • Spanish is typically 10-20% longer than English. Translations may push a scene over budget.
  • Pause-heavy delivery uses fewer words per scene but the same time
  • Emotional beats (especially confessional / vulnerable moments) need slower pacing — fewer words
  • High-energy beats can fit more words

If you notice a Veo clip's delivery sounds rushed, the scene's word count is probably too high for the energy level. Split it.

B-roll placement logic

B-roll cutaways interrupt speaking scenes. They're chosen at moments where:

B-roll placement When
Visual reinforcement Speaker mentions a specific visual concept that's hard to imagine ("the wing-shaped molecule that does this...")
Emotional pause A beat where the speaker pauses for effect — viewer's eye benefits from somewhere else to look
Mechanism explanation Speaker is explaining HOW something works — show it instead of describe it
Product close-up Speaker is introducing or referencing the product — show the product

The Visual Planner chooses placements based on the B-roll density mode (High / Medium / Low / None).

Camera per scene

Each scene gets its own camera framing decision:

  • Selfie / chest-up (most common) — talking head, intimate, casual
  • Medium / waist-up — slightly more formal, allows hand gestures in frame
  • Wide / full body — used sparingly, usually for transitional / contextual scenes
  • Close-up / face-only — used for emotional peaks
  • POV shot — first-person, hand reaching into frame holding the product (see Image Prompt Rules - POV)

Workflows usually settle on one primary framing with 1-2 deviations for emphasis. Constantly varying the framing reads as jumpy.

Pose per scene

Subject pose per scene typically varies subtly:

Scene type Pose
Hook Slight forward lean, direct gaze, engaged
Body — explainer beat Settled, balanced, gesturing as appropriate
Body — emphasis beat Forward lean intensified
Body — confessional / vulnerable Slightly more relaxed, weight shifted
CTA Engaged again, energy bumped

The Visual Planner specifies the pose per scene. The Image Prompter writes it into the image prompt.

Lighting per scene

For most workflows, lighting stays consistent across all scenes — same direction, same quality. Changing lighting mid-workflow reads as a new location, which usually isn't what you want.

Exceptions:

  • A specific scene is meant to feel different (a darker / more intimate beat)
  • The workflow spans multiple settings (e.g., kitchen for some scenes, bathroom for others)
  • A transformation clip uses lighting change as the transformation

The default: same lighting across all speaking scenes.

Pacing decisions

The Visual Planner also decides the rhythm of the workflow:

Average scene length
If most scenes are 7-8 seconds, you have steady pacing. If you mix 4-second beats with 8-second beats, you have rhythmic pacing (faster cuts feel more energetic).
Hook duration
Usually 4-6 seconds. Long enough to deliver the hook line, short enough that the viewer commits to staying.
CTA duration
Usually 4-6 seconds. Brief — the viewer doesn't need a lot of time to decide.
Body proportion
Body should be the bulk of the workflow. If hook + CTA exceed ⅓ of total runtime, the body is being squeezed.

When the script doesn't fit the constraint

A 90-second script with a 30-word body sentence somewhere doesn't fit. The Visual Planner has to either:

  • Split the sentence into two scenes (mechanical — split at a clause break)
  • Trim the script (asks the Script Writer to tighten that sentence)
  • Use B-roll over part of the line (the audio continues, but visually we cut to B-roll while the speaker reads the long sentence)

The Visual Planner picks one approach per situation. Splitting is the most common.

Storyboarding for video copy

When the workflow is a video copy, the storyboarding isn't decided fresh — it matches the source video's structure.

The contact-sheet analysis tells you:

  • Where the source's hard cuts are (each cut = one scene)
  • How long each scene runs (matches your scene durations)
  • What framing the source uses per scene (matches your framing)
  • Whether B-roll is present (matches your B-roll density)

Video copy storyboarding is mechanical mapping. From-scratch storyboarding is creative.

When you're ready

Next: The .nbflow Schema (light) — what's actually inside a workflow file, at a level useful for understanding what the agents produce.