IndexTTS2 Voice Cloning with Emotion Control
Emotion Control
text to speech
voice cloning
0
38
Nodes & Models
LoadAudio
Note
WorkflowGraphics
IndexTTS2Simple
IndexTTS2SaveAudio
Voice cloning that goes beyond matching a speaker. IndexTTS2 takes a reference audio clip, copies the voice, and lets you control how the output sounds emotionally: calm, excited, tense. No re-recording needed.
Upload a short reference clip of the voice you want to clone, type your script, and run. The output is a high-quality WAV file in that voice. Want to push further? Swap in a second audio clip to drive the emotional tone, or use a text or vector-based emotion input to dial in exactly the feeling you need.
How do you use IndexTTS2 for voice cloning?
Upload a reference audio clip of the voice you want to clone, type your script into the text field, and run. IndexTTS2 outputs a WAV file in that voice. For emotion control, connect a second audio clip or an emotion vector to shape the delivery.
Reference audio (voice to clone) This is the voice. A clean 5 to 30 second clip of the target speaker works well. Cleaner audio with less background noise gives the model a stronger signal to clone from. Avoid clips with music, reverb, or multiple speakers.
Text Your script goes here. There is no hard length limit, but shorter passages tend to produce cleaner results. If you have a long script, break it into sections and run them separately.
Emotion control (three options) The builder's note lays out the three modes clearly:
No emotion input connected: the output matches the emotional tone already present in the reference clip. This is the fastest path.
Emotion audio connected: upload a second audio clip that carries the emotional tone you want. The model clones the voice from the first clip and the feeling from the second.
Emotion vector or Emotion From Text node: dial in emotion precisely without needing a second recording. Useful when you want to specify delivery without hunting for a matching audio sample.
Start with the reference-only mode. Add emotion control once you have the base voice sounding right.
FP16 Leave this off. The builder note is clear: off produces better quality output.
Output WAV at 320k, pcm16 by default. Ready to drop into any audio editor or video timeline.
What is IndexTTS2 good for?
It is built for anyone who needs a specific voice without access to that speaker. Narration, character dialogue, dubbing, and content localization are all strong fits. The emotion control makes it useful beyond basic voice cloning.
If you are producing a short film, game, or animated series and need consistent character voices across many lines, this is a much faster path than recording sessions. Clone the voice once, script as many lines as you need.
Localization and dubbing work well too. Clone the original speaker's voice, write the translated script, and the output keeps the voice consistent across languages.
Where it has limits: very short reference clips or noisy recordings produce weaker clones. The model needs a clean, reasonably long sample to capture the full character of a voice. For highly expressive or singing voices, results vary.
FAQ
How long does the reference audio clip need to be for IndexTTS2? A clean clip of 5 to 30 seconds is the practical range. Longer clips give the model more to work with, but quality of the recording matters more than length. One clear sentence from a quiet environment often beats 60 seconds of noisy audio.
Can I control the emotion of the cloned voice in IndexTTS2? Yes, in three ways. Leave emotion inputs disconnected and the model mirrors the tone from the reference clip. Connect a second audio clip to transfer emotion from that recording. Or connect an Emotion Vector or Emotion From Text node to specify delivery without a second audio file.
What file format does IndexTTS2 output? WAV by default, at 320k and pcm16. This is uncompressed audio ready for editing. You can convert it downstream to MP3 or any other format in your audio editor.
Is IndexTTS2 good for long-form narration? It works, but break long scripts into shorter segments and run them separately. Shorter passages produce more consistent pacing and intonation. Stitch the clips together in post.
How do I run IndexTTS2 voice cloning online? You can run it on Floyo. No installation, no setup. Open the workflow in your browser, upload your reference audio, type your script, and hit run.
Read more

