API

Pricing

Workflows

API

Pricing

IndexTTS2 Voice Cloning with Emotion Control

Emotion Control

text to speech

voice cloning

350

Generates in about 1 min 2 secs

floyoofficial

Nodes & Models

ComfyUI Official

LoadAudio

Note

WorkflowGraphics

IndexTTS2Simple

IndexTTS2SaveAudio

Voice cloning that goes beyond matching a speaker. IndexTTS2 takes a reference audio clip, copies the voice, and lets you control how the output sounds emotionally: calm, excited, tense. No re-recording needed.

Upload a short reference clip of the voice you want to clone, type your script, and run. The output is a high-quality WAV file in that voice. Want to push further? Swap in a second audio clip to drive the emotional tone, or use a text or vector-based emotion input to dial in exactly the feeling you need.

How do you use IndexTTS2 for voice cloning?

Upload a reference audio clip of the voice you want to clone, type your script into the text field, and run. IndexTTS2 outputs a WAV file in that voice. For emotion control, connect a second audio clip or an emotion vector to shape the delivery.

Reference audio (voice to clone) This is the voice. A clean 5 to 30 second clip of the target speaker works well. Cleaner audio with less background noise gives the model a stronger signal to clone from. Avoid clips with music, reverb, or multiple speakers.

Text Your script goes here. There is no hard length limit, but shorter passages tend to produce cleaner results. If you have a long script, break it into sections and run them separately.

Emotion control (three options) The builder's note lays out the three modes clearly:

No emotion input connected: the output matches the emotional tone already present in the reference clip. This is the fastest path.

Emotion audio connected: upload a second audio clip that carries the emotional tone you want. The model clones the voice from the first clip and the feeling from the second.

Emotion vector or Emotion From Text node: dial in emotion precisely without needing a second recording. Useful when you want to specify delivery without hunting for a matching audio sample.

Start with the reference-only mode. Add emotion control once you have the base voice sounding right.

FP16 Leave this off. The builder note is clear: off produces better quality output.

Output WAV at 320k, pcm16 by default. Ready to drop into any audio editor or video timeline.

What is IndexTTS2 good for?

It is built for anyone who needs a specific voice without access to that speaker. Narration, character dialogue, dubbing, and content localization are all strong fits. The emotion control makes it useful beyond basic voice cloning.

If you are producing a short film, game, or animated series and need consistent character voices across many lines, this is a much faster path than recording sessions. Clone the voice once, script as many lines as you need.

Localization and dubbing work well too. Clone the original speaker's voice, write the translated script, and the output keeps the voice consistent across languages.

Where it has limits: very short reference clips or noisy recordings produce weaker clones. The model needs a clean, reasonably long sample to capture the full character of a voice. For highly expressive or singing voices, results vary.

FAQ

How long does the reference audio clip need to be for IndexTTS2? A clean clip of 5 to 30 seconds is the practical range. Longer clips give the model more to work with, but quality of the recording matters more than length. One clear sentence from a quiet environment often beats 60 seconds of noisy audio.

Can I control the emotion of the cloned voice in IndexTTS2? Yes, in three ways. Leave emotion inputs disconnected and the model mirrors the tone from the reference clip. Connect a second audio clip to transfer emotion from that recording. Or connect an Emotion Vector or Emotion From Text node to specify delivery without a second audio file.

What file format does IndexTTS2 output? WAV by default, at 320k and pcm16. This is uncompressed audio ready for editing. You can convert it downstream to MP3 or any other format in your audio editor.

Is IndexTTS2 good for long-form narration? It works, but break long scripts into shorter segments and run them separately. Shorter passages produce more consistent pacing and intonation. Stitch the clips together in post.

How do I run IndexTTS2 voice cloning online? You can run it on Floyo. No installation, no setup. Open the workflow in your browser, upload your reference audio, type your script, and hit run.

Discover more workflows

You might like these too.

VibeVoice: Single-Speaker Text to Speech

floyoofficial

995

text to speech

TTS

VibeVoice

voice cloning

VibeVoice

VibeVoice: Single-Speaker Text to Speech

VibeVoice

floyoofficial

394

audiodit

dialogue

longcat

multi-speaker

text to speech

voice cloning

Clone two voices from short audio samples and generate dialogue between them with LongCat AudioDiT 3.5B. Upload your references, write your script, hit run.

LongCat AudioDiT for Multi Speaker TTS

Clone two voices from short audio samples and generate dialogue between them with LongCat AudioDiT 3.5B. Upload your references, write your script, hit run.

floyoofficial

25.2k

AiVideo

API

image to video

video generation

wan 2.5

Wan 2.5: Image to Video with Audio

Z-Image Turbo: Fast Image Generation in Seconds

floyoofficial

21.9k

Marketing

Photography

Production

Text2Image

Z-Image Turbo

Fast Image Generation in Seconds

Z-Image Turbo: Fast Image Generation in Seconds

Fast Image Generation in Seconds

floyoofficial

14.6k

VFX

Video2Video

Video Production

Wan2.6

Wan 2.6 Reference to Video

floyoofficial

14.6k

API

gemini 3 pro

Image2Image

typography

Google just released Nano Banana Pro, and honestly, it's a pretty big step up from the original Nano Banana. The main thing? It can actually put legible text in images now. Like, real text that you can read, not the garbled nonsense most AI models spit out.

Nano Banana Pro: Generate & Edit Images

mdmz

11.0k

wan 2.2

wan22

wan 2.2 animate

wan 22 animate

wan animate

Wan 2.2 Animate Preprocess by Kijai (MDMZ Edition)