Step 5: Audio

Configure voice (style, pacing, language, accent) and background music for your video.

intermediateCreator4 min readUpdated 2026-06-12

What is the Audio step?

The Audio step controls everything you hear in your video — both the spoken narration and the background music. Toggle voice on or off, choose a language, accent (for English), delivery style, and pacing. Then add AI-generated background music by describing it, and balance the music volume against the voice. This step is optional — defaults produce a natural voiced video with subtle music.

Enable Voice: Toggle spoken audio on or off. Off creates a silent or music-only video.
Voice Source: Character-Tied uses a voice matching your selected character. This is the default.
Language: The spoken language: English, Spanish, French, German, Italian, Portuguese, Hindi, Japanese, Mandarin Chinese, Korean, Arabic, or Russian.
Accent: For English, choose an accent variant (e.g., American). The accent option is hidden for other languages.
Delivery Style: The tone and energy: Conversational, Energetic, Calm, Authoritative, or Whispery.
Pacing: Speaking speed: Slow, Normal, or Fast.
Music: AI-generated background music. Describe the music you want in the prompt box, or switch it off for no music.
Volume Mix: Balance between background music and speech. Default is 30% music — optimized for speech clarity.

Configuring Audio

1
Set the voice
Leave Enable Voice on (default) for spoken audio tied to your character, or toggle it off for a music-only video.
2
Choose language and accent
Pick the spoken language from 12 options. For English you can also pick an accent.
3
Pick delivery style and pacing
Choose the tone (Conversational, Energetic, Calm, Authoritative, Whispery) and speed (Slow, Normal, Fast).
4
Describe your music
In the Add Music & Mix section, describe the background music you want (e.g., 'upbeat acoustic with light percussion'), or turn music off.
5
Balance the mix
Use the volume sliders to mix music against speech. The 30% default works well; nudge louder for social content, quieter for educational content.

Note

Voice and music are generated natively by the AI video models as part of the video — there's no separate audio pipeline, so the mix is coherent in one pass.

Tip

Leaving the music prompt empty is fine — the AI falls back to subtle background music matched to your volume setting.

Related Guides

Step 4: Concept & Script

Workflow Best Practices

Was this article helpful?