API

Pricing

Workflows

API

Pricing

LongCat AudioDiT 3.5B - TTS, Voice Clone, Multi-Sp

longcat

text to audio

tts

190

AUDIO EDIT - AFX_1776908921933_1777529040435.webp

Generates in about 1 min 49 secs

floyoofficial

Nodes & Models

ComfyUI Official

LongCatTTS

LoadAudio

MarkdownNote

PreviewAudio

NormalizeAudioLoudness

Reroute

LongCatMultiSpeakerTTS

LongCatVoiceCloneTTS

SaveAudioMP3

Three text-to-speech modes in one workflow, all powered by LongCat AudioDiT 3.5B.

Generate speech from text. Clone a voice from a 3 to 15 second audio sample. Or write dialogue between multiple speakers using simple speaker tags. Same model, three nodes, three modes.

Type your text, pick your mode, hit run. Models auto-download on first use.

How do you use LongCat AudioDiT 3.5B for text to speech?

Pick the node that matches what you need: LongCatTTS for basic text to speech, LongCatVoiceCloneTTS for cloning from a reference audio, or LongCatMultiSpeakerTTS for dialogue. Type your text, leave steps at 16 and guidance at 4, hit run. The model auto-downloads on first run.

Text input What you want spoken. Plain text for basic TTS. For multi-speaker mode, use [speaker_1]:, [speaker_2]: tags at the start of each line. The model handles punctuation and pacing on its own.

Reference audio (voice cloning) 3 to 15 seconds works best. Cleaner source audio means a cleaner clone. The workflow includes optional Whisper transcription nodes (bypassed by default). Enable them to auto-generate the transcript from your reference, or paste the transcript in manually for sharper results.

Steps Defaults: 16 for basic TTS, 25 for voice clone, 16 for multi-speaker. Range: 4 to 64. Want faster generation? Drop to 4 to 8. Want cleaner audio at higher cost? Push to 32. Past 32 the quality gains flatten out.

Guidance strength Default: 4.0. Range: 0 to 10. Higher values stick closer to the text and reference, lower values give more variation. The catch: above 7 the output starts sounding compressed.

Guidance method Two options. Use cfg for basic TTS (it's the LongCatTTS default). Use apg for voice cloning and multi-speaker (defaults on those nodes). APG holds onto reference voice characteristics better.

Model variant Default: LongCat-AudioDiT-3.5B-bf16, recommended at ~12GB VRAM. Switch to fp8 (~8-12GB) if VRAM is tight, or fp32 (~20GB) for maximum quality on heavy hardware.

Attention Default: auto. Set to sage_attention for faster generation on supported GPUs, or flash_attention for another speed bump. Leave on auto if you're not sure.

Keep model loaded Default: true. Caches the model and offloads to CPU between runs for fast follow-up generations. Set to false if you want VRAM freed after each run. The tradeoff: longer load time on the next run.

What is LongCat AudioDiT 3.5B good for?

Generating natural speech for content production: voiceovers, audiobook drafts, podcast intros, narration tracks, dialogue for video, and prototype audio for storyboards. Voice cloning lets you reuse a single reference voice across a project. Multi-speaker mode handles back-and-forth conversation with consistent voices for each tagged speaker.

The voice cloning is the standout. Drop in a 5-second clip of someone speaking and the model produces new lines in that same voice. Useful for character consistency across episodes, ADR-style replacement on existing tracks, or synthetic narration that matches an existing brand voice.

Multi-speaker mode is rare in open-source TTS. Most models force you to generate each speaker's lines separately and stitch them later. LongCat handles a full conversation in one pass with stable characterization for each tagged speaker.

When to skip: producing a hero voiceover for a major commercial release? Studio TTS like ElevenLabs gives more polish. Need real-time TTS for an app? This is diffusion-based and runs slower than streaming options.

FAQ

What's the difference between cfg and apg guidance in LongCat AudioDiT? Both control how strictly the model follows your input. CFG is tuned for plain text-to-speech and works well for the basic LongCatTTS node. APG holds onto reference voice characteristics more tightly, which matters when you're cloning a voice or running multi-speaker dialogue. Use the defaults each node ships with unless you have a reason to override.

How long should my reference audio be for LongCat AudioDiT voice cloning? 3 to 15 seconds. Around 5 to 8 seconds of clean, single-speaker audio gives the best clone quality. Longer references don't help and can actually hurt if they include background noise or other speakers. Adding a transcript via the prompt_text input improves quality, so paste it in or enable the Whisper auto-transcribe nodes.

Can LongCat AudioDiT generate dialogue between multiple speakers? Yes. Use the LongCatMultiSpeakerTTS node and tag each line with [speaker_1]:, [speaker_2]:, and so on. Connect a reference audio for each speaker. The model produces a single audio file with each speaker keeping a stable voice across their lines, including pacing and turn-taking.

How much VRAM does LongCat AudioDiT 3.5B need? The bf16 model needs around 12GB of VRAM and is the recommended default. The fp8 variant runs in 8-12GB for tighter setups. The fp32 variant needs ~20GB and is overkill for most uses. All three auto-download on first run.

How to run LongCat AudioDiT TTS online? You can run LongCat AudioDiT TTS online through Floyo. No installation, no setup. Open the workflow in your browser, type your text or upload reference audio, and hit run.

Discover more workflows

You might like these too.

floyoofficial

163

audiodit

audio generation

longcat

text to speech

tts

Turn text into spoken audio with LongCat AudioDiT 3.5B, Meituan's open-source diffusion TTS model. Clean voice quality in English and Chinese, no setup.

LongCat AudioDiT for TTS

Turn text into spoken audio with LongCat AudioDiT 3.5B, Meituan's open-source diffusion TTS model. Clean voice quality in English and Chinese, no setup.

Voice Changer using TTS Audio Suite (ChatterBox)

floyoofficial

774

audio

Audio2Audio

Chatterbox

tts

TTS Audio Suite

voice conversion

Convert any voice to match a target speaker using ChatterBox TTS. Upload source and narrator audio, run it, get back a converted MP3. No voice training needed.

Voice Changer using TTS Audio Suite (ChatterBox)

Convert any voice to match a target speaker using ChatterBox TTS. Upload source and narrator audio, run it, get back a converted MP3. No voice training needed.

floyoofficial

25.2k

AiVideo

API

image to video

video generation

wan 2.5

Wan 2.5: Image to Video with Audio

Z-Image Turbo: Fast Image Generation in Seconds

floyoofficial

21.9k

Marketing

Photography

Production

Text2Image

Z-Image Turbo

Fast Image Generation in Seconds

Z-Image Turbo: Fast Image Generation in Seconds

Fast Image Generation in Seconds

floyoofficial

14.6k

VFX

Video2Video

Video Production

Wan2.6

Wan 2.6 Reference to Video

floyoofficial

14.6k

API

gemini 3 pro

Image2Image

typography

Google just released Nano Banana Pro, and honestly, it's a pretty big step up from the original Nano Banana. The main thing? It can actually put legible text in images now. Like, real text that you can read, not the garbled nonsense most AI models spit out.

Nano Banana Pro: Generate & Edit Images

mdmz

11.0k

wan 2.2

wan22

wan 2.2 animate

wan 22 animate

wan animate

Wan 2.2 Animate Preprocess by Kijai (MDMZ Edition)