Create with Alibaba Happy Horse model now! Try here 👉

Pricing

Create with Alibaba Happy Horse model now! Try here 👉

COMMUNITY PAGE

Run Vidu Q3 on Floyo

Home / Model / Vidu Q3 on Floyo

AI VIDEO GENERATION

Run Vidu Q3 on Floyo

ShengShu's world model that generates 16-second 1080p video with native synchronized audio in one pass. #1 on the Artificial Analysis benchmark at launch. Multi-shot sequencing, camera control, lip-sync, and reference-based consistency.

Run ShengShu's Vidu Q3 through ComfyUI in your browser. No API key, no installs, no local GPU.

Resolution

1080p @ 24fps

Duration

Up to 16 seconds

Audio

Native sync (dialogue + SFX + music)

Architecture

U-ViT (World Model)

Try Vidu Q3 Now → Browse All Models

No installation. Runs in browser. Updated April 2026.

What to get?

Vidu Q3 is ShengShu Technology's world-model-based video generator that ranked #1 on the Artificial Analysis benchmark at launch (January 2026). It generates up to 16 seconds of 1080p video with native synchronized audio (dialogue, sound effects, background music) in a single pass. Built on the U-ViT (Universal Vision Transformer) architecture from ShengShu's Foundation World Model. Supports text-to-video, image-to-video, and reference-to-video with multi-entity consistency from 1-4 reference images. Multi-shot sequencing with automatic camera switching, 6 cinematic VFX types, and multilingual lip-sync. Available as ComfyUI API nodes on Floyo.

VIDU Q3 WORKFLOWS ON FLOYO

Vidu Q3 for Text to Video

Vidu Q3 for Image to Video

What is Vidu Q3?

Vidu Q3 is a video generation model from ShengShu Technology (also known as Shengshu or Vidu AI), released in late January 2026. It ranked #1 on the Artificial Analysis Video Arena at launch and currently holds an Elo rating of 1220-1244, placing it #2 globally behind Sora 2. It generates up to 16 seconds of 1080p video with synchronized audio (dialogue, sound effects, background music) in one forward pass.

The core advancement over previous Vidu versions (Q1, Q2) is the "all-in-one" generation approach. Earlier models required separate workflows for visual generation and audio post-production. Vidu Q3 generates video and audio together. The model understands the physics of sound and light simultaneously, so a rainy street scene automatically includes rain acoustics, ambient traffic, and atmospheric audio without you specifying them.

The 16-second duration is meaningful. Most competitor models cap at 5-10 seconds. With 16 seconds, you can fit a complete narrative arc: scene establishment, action, and resolution. Multi-shot sequencing within that window adds camera cuts, pans, zooms, and transitions. The model handles these automatically based on narrative understanding, or you can direct them manually through your prompt.

In April 2026, ShengShu released Vidu Q3 Reference-to-Video, which accepts 1-4 reference images to maintain character consistency across generations. This addresses the biggest production gap in AI video: making the same character look the same in every scene. The model also added 6 types of cinematic visual effects (particle systems, fluid simulation, dynamic motion, camera movement, transitions, lighting).

On Floyo, Vidu Q3 runs through ComfyUI API nodes. Two workflows cover text-to-video and image-to-video. Write a prompt or upload a reference image, and get a complete video with audio in 2-5 minutes. No audio post-production needed.

What are Vidu Q3's technical specifications?

Vidu Q3 uses ShengShu's proprietary U-ViT (Universal Vision Transformer) architecture, part of the Foundation World Model framework. It generates 1080p video at 24fps for up to 16 seconds with native synchronized audio (dialogue, SFX, music) in a single forward pass. Supports text-to-video, image-to-video, and reference-to-video (1-4 references). Multilingual lip-sync in English, Chinese, and Japanese.

Spec	Details
Developer	ShengShu Technology (Shengshu / Vidu AI)
Architecture	U-ViT (Universal Vision Transformer), Foundation World Model
Resolution	360p, 480p, 720p, 1080p
Frame Rate	24fps
Duration	Up to 16 seconds per generation
Audio	Native synchronized (dialogue + SFX + background music) in one pass
Lip-Sync Languages	English, Chinese, Japanese
Modes	T2V, I2V, Reference-to-Video (1-4 reference images)
Multi-Shot	Yes (automatic camera switching based on narrative)
VFX Types	6 (particle systems, fluid, dynamic motion, camera, transitions, lighting)
Aspect Ratios	16:9, 9:16, 1:1, and more
Reference Consistency	Multi-entity from 1-4 reference images (characters, props, styles)
Prompt Enhancer	Built-in LLM for automatic prompt improvement (toggleable)
Arena Ranking	#1 at launch (Artificial Analysis), Elo 1220-1244 (#2 globally)
Generation Time	2-5 minutes per clip
ComfyUI Access	API-based nodes on Floyo (2 workflows)
Release Date	January 30, 2026 (Q3) / April 13, 2026 (Reference-to-Video)

What can you create with Vidu Q3?

Vidu Q3 covers text-to-video, image-to-video, reference-based character-consistent video, multi-shot narrative sequences, product demos with voiceover, animated series episodes, ad creatives with dialogue, cinematic VFX shots, and architectural walkthroughs. All modes produce synchronized audio. The 16-second duration supports complete narrative arcs without manual stitching.

Capability	What It Does	Use Case
Text-to-Video	Generate 16-second 1080p video with synchronized dialogue, SFX, and music from a text prompt. Multi-shot camera control and automatic transitions.	Short films, ads, social content, product demos
Image-to-Video	Animate a still image into cinematic video with synchronized audio. Preserves composition while adding natural motion, lighting, and camera movement.	Product animation, concept art to motion, photo stories
Reference-to-Video	Upload 1-4 reference images (characters, props, costumes, styles). The model generates video maintaining visual consistency for every referenced element.	Animated series, brand characters, consistent campaigns
Native Audio-Video	Lip-synced dialogue, ambient soundscapes, Foley SFX, and background music generated alongside video in one pass. No post-dubbing.	Talking head content, narrative ads, immersive scenes
Cinematic VFX	6 VFX types: particle systems, fluid simulation, dynamic motion, camera movement, scene transitions, and lighting effects.	Visual effects, stylized content, cinematic sequences
Pipeline Integration	Chain with other models in ComfyUI. Generate a character with Nano Banana, animate with Vidu Q3, add custom narration with Fish Audio S2, upscale with Topaz.	Multi-model production pipelines

How does Vidu Q3 compare to other video models?

Vidu Q3 leads on native audio-video duration (16 seconds) and narrative storytelling. HappyHorse 1.0 currently holds #1 on the Artificial Analysis Arena with higher Elo. Kling 3.0 Omni offers 4K at 60fps. Seedance 2.0 supports 12-file multi-modal input. Wan 2.7 leads on open-source flexibility. Vidu Q3's edge: longest native audio-video clips, multi-entity reference consistency, and world-model reasoning.

Model	Duration	Native Audio	Reference Input	Arena Elo
Vidu Q3	16 seconds	Yes (dialogue + SFX + music)	1-4 images (multi-entity)	1,220-1,244
HappyHorse 1.0	15 seconds	Yes (7-lang lip-sync)	Subject reference	1,392 (#1)
Kling 3.0 Omni	15 seconds	Yes (5+ lang)	Elements system	N/A
Wan 2.7	5-10 seconds	No	Reference images	N/A

Source: Artificial Analysis Video Arena, SuperCLUE Reference-to-Video leaderboard, ShengShu Technology official announcements, WaveSpeedAI documentation, Atlas Cloud blog, and third-party reviews as of April 2026.

How does Vidu Q3 work?

Vidu Q3 uses ShengShu's proprietary U-ViT (Universal Vision Transformer) architecture, part of the Foundation World Model framework. The model generates video and audio simultaneously by reasoning about motion, sound, and narrative context as a unified prediction task. Text, visual, and audio signals are processed together, which is why the output is synchronized by default.

For text-to-video, your prompt feeds through an optional prompt enhancer, then the U-ViT generates frames and audio tokens together. The model plans camera movements, shot transitions, and audio events as part of the same sequence. For image-to-video, the input image anchors the first frame while the model generates motion, camera work, and audio around it.

Reference-to-Video uses multi-reference fusion. You upload 1-4 images that define characters, props, or styles. The model extracts identity features from each reference and maintains them throughout the generated video. This is different from single-image animation: multiple separate references merge into one coherent scene while each entity keeps its distinct visual identity.

On Floyo, Vidu Q3 runs through ComfyUI API nodes. Your prompt and any reference images are sent to ShengShu's inference servers. The complete video with synchronized audio returns to your ComfyUI canvas as an MP4. You can chain Vidu Q3 with other models: upscale with Topaz, add custom narration with Fish Audio S2, or watermark with Orion 4D.

Fair warning: Vidu Q3 is API-based, not open source. Generation runs on ShengShu's servers. Generation takes 2-5 minutes per clip depending on resolution and duration. The baked audio is convenient but may not match your licensed music library. For brand-critical work, you may want to replace the generated audio with your own assets in post. Fingers and fine details can still get mushy in some generations. API pricing applies through your Floyo API Wallet.

Frequently Asked Questions

Common questions about running Vidu Q3 on Floyo.

Is Vidu Q3 free to use on Floyo?

You can start with Floyo's free pricing plan. Floyo gives $0.25 in free API credits on signup. To continue using the service beyond the free tier, upgrade your Floyo pricing plan. Vidu Q3 runs as an API node, so generation costs come from your API Wallet (separate from your plan's GPU time).

How do I run Vidu Q3 without installing anything?

Open Floyo in your browser, search "Vidu Q3" in the template library, and pick the text-to-video or image-to-video workflow. Click Run, write your prompt or upload an image, and generate. Floyo handles the ComfyUI environment and API connection. No local install, no Python setup, no API key management.

Who made Vidu Q3?

ShengShu Technology, an AI company based in Singapore specializing in multimodal generative AI. Founded by Dr. Zhu Jun. The Vidu Q3 model launched January 30, 2026 and ranked #1 globally on the Artificial Analysis benchmark at launch. ShengShu demonstrated the world's first AI-powered animated series production at SXSW 2026. Vidu is integrated into Alibaba Cloud Model Studio.

Does Vidu Q3 generate audio with video?

Yes. Dialogue with lip-sync, environmental sound effects, and background music are generated in the same pass as the video. No post-dubbing needed. The audio sits around -14 to -12 LUFS, comfortable for social platforms. You can prompt for mood ("gentle electronic," "minimal percussion") or let the model match audio to visuals automatically.

How does Vidu Q3 compare to HappyHorse 1.0?

HappyHorse 1.0 currently ranks higher on the Artificial Analysis Arena (Elo 1,392 vs Vidu Q3's 1,220-1,244) and supports 7-language lip-sync. Vidu Q3 generates 16-second clips (vs 15 seconds), supports multi-entity reference consistency from 1-4 images, and offers 6 built-in VFX types. Vidu Q3 is available on Floyo now. HappyHorse is coming soon.

Can I combine Vidu Q3 with other AI models in one workflow?

Yes. Floyo runs ComfyUI, which lets you chain multiple models. Generate a character with Nano Banana, animate with Vidu Q3, add custom narration with Fish Audio S2, upscale with Topaz Video AI. Or use Vidu Q3 for the video and replace the baked audio with a licensed track. All in one pipeline.

Can I use Vidu Q3 output commercially?

Check ShengShu's terms of service for commercial usage details. The generated audio includes music and sound effects that may have specific rights considerations. For brand-critical work, many creators replace the baked audio with their own licensed assets in post-production.

How long does Vidu Q3 take to generate?

Typical generation takes 2-5 minutes depending on resolution and duration. A 16-second 1080p clip takes longer than a 4-second 480p clip. Some API providers offer off-peak pricing for non-urgent jobs that process within 48 hours at reduced cost.

Try Vidu Q3 on Floyo

16-second 1080p video with native audio, multi-shot sequencing, reference consistency, and cinematic VFX. Run it in your browser.

Try Vidu Q3 Now → Browse All Models