Step 5: Audio

Configure voice (style, pacing, language, accent) and background music for your video.

intermediateCreator4 min readUpdated 2026-06-12

What is the Audio step?

The Audio step controls everything you hear in your video — both the spoken narration and the background music. Toggle voice on or off, choose a language, accent (for English), delivery style, and pacing. Then add AI-generated background music by describing it, and balance the music volume against the voice. This step is optional — defaults produce a natural voiced video with subtle music.

Enable Voice
Toggle spoken audio on or off. Off creates a silent or music-only video.
Voice Source
Character-Tied uses a voice matching your selected character. This is the default.
Language
The spoken language: English, Spanish, French, German, Italian, Portuguese, Hindi, Japanese, Mandarin Chinese, Korean, Arabic, or Russian.
Accent
For English, choose an accent variant (e.g., American). The accent option is hidden for other languages.
Delivery Style
The tone and energy: Conversational, Energetic, Calm, Authoritative, or Whispery.
Pacing
Speaking speed: Slow, Normal, or Fast.
Music
AI-generated background music. Describe the music you want in the prompt box, or switch it off for no music.
Volume Mix
Balance between background music and speech. Default is 30% music — optimized for speech clarity.

Configuring Audio

  1. 1

    Set the voice

    Leave Enable Voice on (default) for spoken audio tied to your character, or toggle it off for a music-only video.

  2. 2

    Choose language and accent

    Pick the spoken language from 12 options. For English you can also pick an accent.

  3. 3

    Pick delivery style and pacing

    Choose the tone (Conversational, Energetic, Calm, Authoritative, Whispery) and speed (Slow, Normal, Fast).

  4. 4

    Describe your music

    In the Add Music & Mix section, describe the background music you want (e.g., 'upbeat acoustic with light percussion'), or turn music off.

  5. 5

    Balance the mix

    Use the volume sliders to mix music against speech. The 30% default works well; nudge louder for social content, quieter for educational content.

Note

Voice and music are generated natively by the AI video models as part of the video — there's no separate audio pipeline, so the mix is coherent in one pass.

Tip

Leaving the music prompt empty is fine — the AI falls back to subtle background music matched to your volume setting.

Was this article helpful?