LongCat AudioDiT 3.5B - TTS, Voice Clone, Multi-Sp
longcat
text to audio
tts
0
66
Nodes & Models
LongCatTTS
LoadAudio
MarkdownNote
PreviewAudio
NormalizeAudioLoudness
Reroute
LongCatMultiSpeakerTTS
LongCatVoiceCloneTTS
SaveAudioMP3
Three text-to-speech modes in one workflow, all powered by LongCat AudioDiT 3.5B.
Generate speech from text. Clone a voice from a 3 to 15 second audio sample. Or write dialogue between multiple speakers using simple speaker tags. Same model, three nodes, three modes.
Type your text, pick your mode, hit run. Models auto-download on first use.
How do you use LongCat AudioDiT 3.5B for text to speech?
Pick the node that matches what you need: LongCatTTS for basic text to speech, LongCatVoiceCloneTTS for cloning from a reference audio, or LongCatMultiSpeakerTTS for dialogue. Type your text, leave steps at 16 and guidance at 4, hit run. The model auto-downloads on first run.
Text input What you want spoken. Plain text for basic TTS. For multi-speaker mode, use [speaker_1]:, [speaker_2]: tags at the start of each line. The model handles punctuation and pacing on its own.
Reference audio (voice cloning) 3 to 15 seconds works best. Cleaner source audio means a cleaner clone. The workflow includes optional Whisper transcription nodes (bypassed by default). Enable them to auto-generate the transcript from your reference, or paste the transcript in manually for sharper results.
Steps Defaults: 16 for basic TTS, 25 for voice clone, 16 for multi-speaker. Range: 4 to 64. Want faster generation? Drop to 4 to 8. Want cleaner audio at higher cost? Push to 32. Past 32 the quality gains flatten out.
Guidance strength Default: 4.0. Range: 0 to 10. Higher values stick closer to the text and reference, lower values give more variation. The catch: above 7 the output starts sounding compressed.
Guidance method Two options. Use cfg for basic TTS (it's the LongCatTTS default). Use apg for voice cloning and multi-speaker (defaults on those nodes). APG holds onto reference voice characteristics better.
Model variant Default: LongCat-AudioDiT-3.5B-bf16, recommended at ~12GB VRAM. Switch to fp8 (~8-12GB) if VRAM is tight, or fp32 (~20GB) for maximum quality on heavy hardware.
Attention Default: auto. Set to sage_attention for faster generation on supported GPUs, or flash_attention for another speed bump. Leave on auto if you're not sure.
Keep model loaded Default: true. Caches the model and offloads to CPU between runs for fast follow-up generations. Set to false if you want VRAM freed after each run. The tradeoff: longer load time on the next run.
What is LongCat AudioDiT 3.5B good for?
Generating natural speech for content production: voiceovers, audiobook drafts, podcast intros, narration tracks, dialogue for video, and prototype audio for storyboards. Voice cloning lets you reuse a single reference voice across a project. Multi-speaker mode handles back-and-forth conversation with consistent voices for each tagged speaker.
The voice cloning is the standout. Drop in a 5-second clip of someone speaking and the model produces new lines in that same voice. Useful for character consistency across episodes, ADR-style replacement on existing tracks, or synthetic narration that matches an existing brand voice.
Multi-speaker mode is rare in open-source TTS. Most models force you to generate each speaker's lines separately and stitch them later. LongCat handles a full conversation in one pass with stable characterization for each tagged speaker.
When to skip: producing a hero voiceover for a major commercial release? Studio TTS like ElevenLabs gives more polish. Need real-time TTS for an app? This is diffusion-based and runs slower than streaming options.
FAQ
What's the difference between cfg and apg guidance in LongCat AudioDiT? Both control how strictly the model follows your input. CFG is tuned for plain text-to-speech and works well for the basic LongCatTTS node. APG holds onto reference voice characteristics more tightly, which matters when you're cloning a voice or running multi-speaker dialogue. Use the defaults each node ships with unless you have a reason to override.
How long should my reference audio be for LongCat AudioDiT voice cloning? 3 to 15 seconds. Around 5 to 8 seconds of clean, single-speaker audio gives the best clone quality. Longer references don't help and can actually hurt if they include background noise or other speakers. Adding a transcript via the prompt_text input improves quality, so paste it in or enable the Whisper auto-transcribe nodes.
Can LongCat AudioDiT generate dialogue between multiple speakers? Yes. Use the LongCatMultiSpeakerTTS node and tag each line with [speaker_1]:, [speaker_2]:, and so on. Connect a reference audio for each speaker. The model produces a single audio file with each speaker keeping a stable voice across their lines, including pacing and turn-taking.
How much VRAM does LongCat AudioDiT 3.5B need? The bf16 model needs around 12GB of VRAM and is the recommended default. The fp8 variant runs in 8-12GB for tighter setups. The fp32 variant needs ~20GB and is overkill for most uses. All three auto-download on first run.
How to run LongCat AudioDiT TTS online? You can run LongCat AudioDiT TTS online through Floyo. No installation, no setup. Open the workflow in your browser, type your text or upload reference audio, and hit run.
Read more

