Create with Alibaba Happy Horse model now! Try here 👉

Pricing

Create with Alibaba Happy Horse model now! Try here 👉

COMMUNITY PAGE

Run Chatter box on Floyo

Home / Model / Chatterbox on Floyo

AI AUDIO GENERATION

Run Chatterbox on Floyo

Resemble AI's open-source TTS that outperforms ElevenLabs in blind tests. Zero-shot voice cloning from 5 seconds, emotion exaggeration control, paralinguistic tags, and 23 languages. MIT licensed.

Run Resemble AI's Chatterbox through ComfyUI in your browser. No API key, no installs, no local GPU.

Models

0.5B / Multilingual / Turbo

Languages

Voice Cloning

Zero-shot from 5 seconds

License

MIT (open source)

Try Chatterbox Now →

Browse All Models

No installation. Runs in browser. Updated April 2026.

What to get?

Chatterbox is Resemble AI's family of open-source TTS models. Three variants: the original 0.5B (high quality with emotion control), Multilingual (23 languages with voice cloning), and Turbo (350M parameters, single-step decoding for maximum speed). Outperforms ElevenLabs with 63.75% preference in blind A/B tests. First open-source TTS with emotion exaggeration control. Zero-shot voice cloning from 5 seconds of audio. Paralinguistic tags for [laugh], [cough], [sigh]. Built-in PerTh watermarking. Over 1 million downloads on HuggingFace. MIT licensed. Available as ComfyUI nodes on Floyo.

CHATTERBOX WORKFLOWS ON FLOYO

Chatterbox Text to Speech

Voice Changer Using TTS Audio Suite

What is Chatterbox?

Chatterbox is Resemble AI's family of open-source text-to-speech models, first released in 2025. Built by a three-person team, it achieved what most TTS developers considered unlikely: an open-source model that outperforms ElevenLabs in blind evaluation tests. Over 1 million downloads on HuggingFace and 11,000+ GitHub stars confirm it resonated with the developer community.

The family includes three models. Chatterbox 0.5B is the original: high-quality TTS with emotion exaggeration control and zero-shot voice cloning. Chatterbox Multilingual extends support to 23 languages with voice cloning in each. Chatterbox Turbo is the speed variant: 350M parameters with a distilled single-step decoder that reduces generation from 10 diffusion steps to one.

In blind A/B testing conducted through Podonos, Chatterbox achieved a 63.75% user preference rate against ElevenLabs. Both systems received identical text inputs and 7-20 second voice reference clips with no prompt engineering. The test evaluated zero-shot performance, which is the hardest scenario for any TTS model.

Two features set Chatterbox apart from other open-source TTS. First: emotion exaggeration control. A single parameter lets you dial expressiveness from monotone to dramatically emotional. No other open-source TTS has this. Second: paralinguistic tags. Write [laugh], [cough], [chuckle], [sigh] in your text and the model renders them naturally. Turbo supports these natively.

On Floyo, Chatterbox runs through native ComfyUI nodes on H100 NVL GPUs. Two workflows are available: direct text-to-speech, and a voice changer pipeline using the TTS Audio Suite. No model downloads, no Python environment, no local GPU required.

What are Chatterbox's technical specifications?

Chatterbox is a family of three models: the original 0.5B with emotion control, Multilingual with 23 languages, and Turbo with 350M parameters and single-step decoding. All share zero-shot voice cloning from 5 seconds, built-in PerTh watermarking, and MIT licensing. The 0.5B backbone was trained on 500k hours of curated speech data.

Spec	Details
Developer	Resemble AI
Chatterbox 0.5B	Original model, high quality, emotion exaggeration control, zero-shot cloning
Chatterbox Multilingual	23 languages with voice cloning in each
Chatterbox Turbo	350M parameters, single-step decoder, paralinguistic tags native
Training Data	500k hours of curated speech
Voice Cloning	Zero-shot from 5 seconds of audio (7-20 seconds for best results)
Emotion Control	Emotion exaggeration parameter (monotone to dramatically expressive)
Paralinguistic Tags	[laugh], [cough], [chuckle], [sigh], and more (native in Turbo)
Languages	23 (Multilingual variant). English primary (0.5B, Turbo).
Inference Speed	Faster than real-time (Turbo: single-step decoding)
VRAM Required	5-7GB (runs on RTX 3060 and above)
Watermarking	Built-in PerTh neural watermark (imperceptible)
Blind Test vs ElevenLabs	63.75% user preference for Chatterbox (Podonos evaluation)
Downloads	1M+ on HuggingFace, 11k+ GitHub stars
License	MIT License (full commercial rights)
ComfyUI Access	Native support on Floyo (2 workflows)

What can you create with Chatterbox?

Chatterbox covers expressive voiceovers, voice cloning, voice changing, character dialogue with emotion control, multilingual narration, podcast production, e-learning content, game NPC voices, and voice agent audio. The emotion exaggeration parameter and paralinguistic tags make it suited for character-driven content where flat, neutral TTS would sound robotic.

Capability	What It Does	Use Case
Emotion Exaggeration	A single parameter controls expressiveness from flat monotone to dramatically emotional. First open-source TTS with this feature.	Character voices, dramatic narration, game dialogue
Voice Cloning	Zero-shot cloning from 5 seconds of reference audio. Captures timbre, accent, and speaking style. Works in all 23 languages.	Brand voice, character consistency, personalized content
Voice Changing	Transform existing audio into a different voice while preserving the content and timing. Uses the TTS Audio Suite pipeline.	Dubbing, privacy, character adaptation, content repurposing
Paralinguistic Tags	Write [laugh], [cough], [chuckle], [sigh] in your text. The model renders them naturally at that exact point. Native in Turbo.	Audiobooks, animated characters, social content
23 Languages	Multilingual TTS with voice cloning in each language. English primary. Spanish, French, Mandarin, and more in expanding support.	Localized content, international marketing, multilingual apps
Pipeline Integration	Chain with video models in ComfyUI. Generate video with Wan 2.7 or Hailuo, add narration with Chatterbox in the same workflow. Or use the voice changer to revoice existing content.	Video production, content localization, dubbing pipelines

What are Chatterbox's key features?

Chatterbox's feature set is built around two ideas: give developers the same quality as paid TTS APIs, and give them controls that paid APIs don't offer. Emotion exaggeration and paralinguistic tags are capabilities that ElevenLabs doesn't provide at all. The MIT license means zero per-word costs.

Emotion Exaggeration Control

First open-source TTS with this feature. A single parameter (0.0 to 1.0) controls how expressive the voice is. At 0.0, delivery is flat and monotone. At 1.0, it is dramatically emotional. This is different from emotion presets (happy/sad/angry) because you control intensity, not category. A character can be slightly nervous (0.3) or extremely nervous (0.9) with the same text.

Zero-Shot Voice Cloning

Provide 5 seconds of reference audio. The model captures the speaker's timbre, accent, and speaking characteristics without fine-tuning. Longer samples (7-20 seconds) improve accuracy. Works in all 23 supported languages. The cloned voice responds to emotion exaggeration and paralinguistic tags while maintaining its identity.

Paralinguistic Tags

Write [laugh], [cough], [chuckle], [sigh] directly in your text. The model renders these sounds naturally at that exact point in the speech. In the Turbo variant, these tags are native (not bolted on). This eliminates the need to splice sound effects into TTS output manually.

Turbo: Single-Step Decoding

Chatterbox Turbo uses a distilled 350M parameter architecture that reduces the speech-token-to-mel decoder from 10 diffusion steps to one. This makes it one of the fastest open-source TTS models available. Designed for real-time voice agents and interactive applications where latency is critical.

Built-in Watermarking

All generated audio includes Resemble AI's PerTh neural watermark. The watermark is imperceptible to listeners but detectable by verification tools. This enables tracking of synthetic audio without degrading quality. Important for responsible deployment in contexts where synthetic voice detection matters.

MIT License

Fully open source. Zero per-word, per-minute, or per-character costs from the model license. You can deploy locally, use on Floyo, integrate into commercial products, or modify the code. The same license covers all three variants (0.5B, Multilingual, Turbo). Resemble AI also offers hosted enterprise services for teams that want managed infrastructure.

How does Chatterbox compare to other TTS models?

Chatterbox is the strongest open-source TTS for developers who need both quality and zero API costs. It outperforms ElevenLabs in blind tests. Fish Audio S2 leads on inline emotion tags (1,500+) and language count (80+). MiniMax Speech 2.8 HD leads on broadcast fidelity and arena rankings. VibeVoice leads on long-form multi-speaker (90 min, 4 speakers). Chatterbox's edge: emotion exaggeration, paralinguistic tags, and MIT licensing with no per-word costs.

Model	Emotion Control	Voice Cloning	Open Source	Languages
Chatterbox	Exaggeration dial + tags	5 sec zero-shot	Yes (MIT)	23
Fish Audio S2	1,500+ free-form tags	10-30 sec	Yes (Research)	80+
MiniMax Speech 2.8 HD	7 modes + interjections	5 sec	No (API)	40+
ElevenLabs	Style presets	Instant cloning	No (API)	32+

Source: Resemble AI official documentation, Podonos blind evaluation results, HuggingFace model cards, GitHub repository, and third-party benchmark comparisons as of April 2026.

How does Chatterbox work?

Chatterbox uses a 0.5B parameter backbone trained on 500k hours of curated speech data. The architecture generates speech tokens from text input, then converts those tokens to mel spectrograms through a diffusion decoder (10 steps in original, 1 step in Turbo). A vocoder converts the mel spectrogram to the final audio waveform. Voice cloning works by encoding the reference audio into a speaker embedding that conditions the generation.

The emotion exaggeration parameter modifies the conditioning signal during generation. At low values, the model produces flat, neutral delivery. At high values, it amplifies pitch variation, timing dynamics, and emphasis patterns. This is a continuous control, not a categorical selection, so you get precise authoring of emotional intensity.

Chatterbox Turbo distills the 10-step diffusion decoder into a single-step decoder. The student model learns to approximate the full 10-step output in one forward pass. This dramatically reduces latency while retaining audio fidelity. The 350M parameter count (smaller than the 0.5B original) further reduces compute requirements.

On Floyo, Chatterbox runs through native ComfyUI nodes on H100 NVL GPUs. The text-to-speech workflow takes your text and voice settings and generates audio. The voice changer workflow uses the TTS Audio Suite to transform existing audio into a different voice. Both workflows output audio files that can be chained with video generation nodes in the same ComfyUI pipeline.

Frequently Asked Questions

Common questions about running Chatterbox on Floyo.

Is Chatterbox free to use on Floyo?

You can start with Floyo's free pricing plan. To continue using the service beyond the free tier, upgrade your Floyo pricing plan. Chatterbox is open-source under the MIT License, so there is no additional API cost beyond your Floyo plan. Zero per-word charges.

How do I run Chatterbox without installing anything?

Open Floyo in your browser, search "Chatterbox" in the template library, and pick a workflow. Click Run, write your text, optionally upload a voice reference, and generate. Floyo handles the GPU, ComfyUI environment, and model weights. No local install, no Python setup.

Who made Chatterbox?

Resemble AI, built by a three-person team. The company also offers hosted Chatterbox services for enterprise clients. Model weights are on HuggingFace (ResembleAI/chatterbox, ResembleAI/chatterbox-multilingual, ResembleAI/chatterbox-turbo). Over 1 million downloads and 11,000+ GitHub stars.

What is emotion exaggeration control?

A single parameter (0.0 to 1.0) that controls how expressive the voice is. At 0.0, delivery is flat. At 1.0, it is dramatically emotional. This is not a preset like "happy" or "sad." It is a continuous dial that controls intensity across all emotional dimensions. No other open-source TTS has this feature.

How does Chatterbox compare to ElevenLabs?

In blind A/B testing, Chatterbox was preferred 63.75% of the time over ElevenLabs. Chatterbox offers emotion exaggeration control and paralinguistic tags that ElevenLabs does not. Chatterbox is MIT licensed with zero per-word costs. ElevenLabs has a more polished consumer interface, broader language support, and a larger voice library. For developers building products, Chatterbox gives you more control for less money.

Can I combine Chatterbox with video models in one workflow?

Yes. Floyo runs ComfyUI, which lets you chain multiple models. Generate video with Wan 2.7 or Hailuo, add narration with Chatterbox, or use the voice changer to revoice existing content. All in one pipeline, all in your browser.

Can I use Chatterbox output commercially?

Yes. Chatterbox is released under the MIT License, which grants full commercial usage rights. You can use generated audio in products, marketing, client work, apps, games, and any other commercial context without additional licensing or per-word fees.

Can Chatterbox change an existing voice?

Yes. The "Voice Changer Using TTS Audio Suite" workflow on Floyo lets you transform existing audio into a different voice. Upload source audio, provide a target voice reference, and the pipeline revoices the content while preserving timing and content. Useful for dubbing, privacy, and content repurposing.

Try Chatterbox on Floyo

Open-source TTS that outperforms ElevenLabs. Emotion exaggeration, voice cloning, paralinguistic tags, 23 languages, MIT licensed. Run it in your browser.

Try Chatterbox Now →

Browse All Models