Skip to content

Pipeline Variants — AI Self & Voice Cloning

Specialized pipeline configurations for when the standard "AI avatar + Veo TTS" doesn't fit. Two main variants worth knowing:

  • AI Self / AI Avatar — content featuring a recurring AI-generated character (not a real person)
  • Voice cloning — using a cloned voice instead of Veo's default text-to-speech

These layer on top of the standard pipeline, not replacing it.

AI Self / Avatar content

The standard pipeline produces content where the avatar is one of your accounts' established avatars (Account A, Account B, etc.). Each account has its own face, archetype, and persona.

AI Self content is different — it features a recurring AI-generated character that's deliberately stylized as AI-generated. The character has a name, signature accessories (colored glasses, specific jacket, hairstyle), and consistent visual identity across many videos.

When to use AI Self content

  • Building a "creator" identity that's intentionally AI (transparent about it)
  • Creating content where the AI nature is the hook
  • Brand mascot-style content (the character represents the brand)
  • Content for AI-curious audiences who enjoy obvious AI styling

What changes in the pipeline

Stage 1 (Brief)
You provide character details — name, visual style, signature accessories. The brief includes "this is an AI Self character" so downstream agents know.
Stage 3 (Storyboard)
The Visual Planner adds consistency notes to every scene to keep the character's signature accessories and styling consistent (the colored glasses appear in every shot, the jacket reads the same across scenes, etc.).
Stage 4 (Image prompts)
The Image Prompter uses a character_reference block with face.preserve_original: true to lock identity tightly. This is stronger than normal avatar reference locking — appropriate because the character's identity is the brand.
Stage 6 (Compliance — TikTok specifically)
The AI Label Trick is especially important because the content visibly reads as AI. Without the label, TikTok's moderation flags it more aggressively.

Avoiding generic "wellness influencer" look

A common pitfall with AI Self characters: they default to a generic, easy-to-generate appearance — same hair color and style as 10,000 other generated faces, no distinctive features, vague styling.

To avoid this:

  • Specific signature pieces — a specific colored glasses, a particular jacket pattern, a unique haircut
  • Modern styling cues — trendy details that read as "current" rather than timeless / generic
  • Distinct color palette — the character has a recurring color in their wardrobe

The Image Prompter builds these into every scene's prompt so they show up consistently.

Voice cloning

The standard pipeline uses Veo 3.1's built-in text-to-speech. The avatar's voice is generated by Veo from the dialogue line.

Voice cloning replaces Veo's TTS with a cloned voice — typically of a real person whose voice you've captured. The visuals are still Veo-generated, but the audio is the cloned voice.

When voice cloning matters

  • You have a creator with a distinctive voice and want their voice on AI-generated visuals
  • The client provided a voice sample they want consistent across all content
  • The brand has a voice talent contract that needs to be honored
  • A specific accent / pitch / energy that Veo TTS can't reliably produce

What changes in the pipeline

Stage 2 (Script)
Unchanged. The Script Writer writes the dialogue the same way.
Stage 3 (Storyboard)
Speaking profile still guides gesture intensity and pacing. Visuals are unchanged.
Stage 4 (Image prompts)
Unchanged. Image prompts are about visuals.
Stage 4 (Video prompts)
The Veo Prompter replaces DELIVERY instructions with a cloned voice ID reference. Veo generates the visuals but takes the audio from the cloned voice instead of generating speech.
Stage 5 (Generation)
The Generation Runner produces visuals via Veo, and pairs them with the cloned voice. The two streams sync in post-production (or via Veo's audio-track ingestion if your Veo install supports it).

What voice cloning is external to the pipeline

The voice clone itself is set up outside the pipeline:

  1. You record voice samples of the target person
  2. A voice-cloning tool (Eleven Labs, Resemble AI, or similar) ingests the samples and produces a voice model
  3. The cloned voice has an ID that the pipeline references when generating

The pipeline doesn't manage voice clone training — that's a separate setup the user (or the voice-clone service) handles.

Mouth-sync considerations

When using a cloned voice, you have an extra constraint: the mouth movement in the Veo clip must sync to the cloned voice's pacing.

Two approaches:

Pre-render the audio, then generate the visuals to match
Generate the cloned voice audio first. Pass the audio timing to Veo as a constraint. Veo produces visuals that sync to the audio. Best fit when supported.
Generate visuals first, then time-stretch the audio to match
Veo produces visuals with its default TTS timing. In post, replace the TTS audio with cloned voice audio, time-stretching as needed. Looser but works in any pipeline.

The right approach depends on what your Veo installation supports.

Combining AI Self with voice cloning

These two variants stack. You can have:

  • An AI Self character with a cloned voice (an AI-generated face + a real person's cloned voice)
  • A standard avatar with a cloned voice (account avatar's face + voice talent's cloned voice)

Each combination has its own tradeoffs. Most workflows use neither — standard pipeline, real account avatars with Veo TTS.

Pick a variant when the brief actually requires it. Don't add complexity for its own sake.

Other pipeline variants you might encounter

Less common but worth knowing:

Storytelling / confessional format
Same pipeline mechanics; structurally the script is longer and the visuals are tighter. No different from a standard from-scratch workflow.
Quick-take CTA format
15-second clips, single scene, direct CTA. Same pipeline, just a much shorter script.
Multi-character scenes
Two avatars in the same scene. Adds complexity to the image prompt (managing two character references) and Veo prompt (managing two speakers). Rarely worth the complication.
B-Roll Recipe format
Recipe demo where the background is the recipe action and the avatar narrates as voiceover. Combines PiP format with per-step sequences.

When you're ready

Next: avoid-ai-writing in Practice — using the avoid-ai-writing skill to clean up scripts that read as AI.