
COMMUNITY PAGE
Run Vidu Q3 on Floyo
Home / Model / Vidu Q3 on Floyo
AI VIDEO GENERATION
Run Vidu Q3 on Floyo
ShengShu's world model that generates 16-second 1080p video with native synchronized audio in one pass. #1 on the Artificial Analysis benchmark at launch. Multi-shot sequencing, camera control, lip-sync, and reference-based consistency.
Run ShengShu's Vidu Q3 through ComfyUI in your browser. No API key, no installs, no local GPU.
|
Resolution 1080p @ 24fps |
Duration Up to 16 seconds |
|
Audio Native sync (dialogue + SFX + music) |
Architecture U-ViT (World Model) |
No installation. Runs in browser. Updated April 2026.






What to get?
Vidu Q3 is ShengShu Technology's world-model-based video generator that ranked #1 on the Artificial Analysis benchmark at launch (January 2026). It generates up to 16 seconds of 1080p video with native synchronized audio (dialogue, sound effects, background music) in a single pass. Built on the U-ViT (Universal Vision Transformer) architecture from ShengShu's Foundation World Model. Supports text-to-video, image-to-video, and reference-to-video with multi-entity consistency from 1-4 reference images. Multi-shot sequencing with automatic camera switching, 6 cinematic VFX types, and multilingual lip-sync. Available as ComfyUI API nodes on Floyo.
VIDU Q3 WORKFLOWS ON FLOYO
What is Vidu Q3?
Vidu Q3 is a video generation model from ShengShu Technology (also known as Shengshu or Vidu AI), released in late January 2026. It ranked #1 on the Artificial Analysis Video Arena at launch and currently holds an Elo rating of 1220-1244, placing it #2 globally behind Sora 2. It generates up to 16 seconds of 1080p video with synchronized audio (dialogue, sound effects, background music) in one forward pass.
The core advancement over previous Vidu versions (Q1, Q2) is the "all-in-one" generation approach. Earlier models required separate workflows for visual generation and audio post-production. Vidu Q3 generates video and audio together. The model understands the physics of sound and light simultaneously, so a rainy street scene automatically includes rain acoustics, ambient traffic, and atmospheric audio without you specifying them.
The 16-second duration is meaningful. Most competitor models cap at 5-10 seconds. With 16 seconds, you can fit a complete narrative arc: scene establishment, action, and resolution. Multi-shot sequencing within that window adds camera cuts, pans, zooms, and transitions. The model handles these automatically based on narrative understanding, or you can direct them manually through your prompt.
In April 2026, ShengShu released Vidu Q3 Reference-to-Video, which accepts 1-4 reference images to maintain character consistency across generations. This addresses the biggest production gap in AI video: making the same character look the same in every scene. The model also added 6 types of cinematic visual effects (particle systems, fluid simulation, dynamic motion, camera movement, transitions, lighting).
On Floyo, Vidu Q3 runs through ComfyUI API nodes. Two workflows cover text-to-video and image-to-video. Write a prompt or upload a reference image, and get a complete video with audio in 2-5 minutes. No audio post-production needed.
What are Vidu Q3's technical specifications?
Vidu Q3 uses ShengShu's proprietary U-ViT (Universal Vision Transformer) architecture, part of the Foundation World Model framework. It generates 1080p video at 24fps for up to 16 seconds with native synchronized audio (dialogue, SFX, music) in a single forward pass. Supports text-to-video, image-to-video, and reference-to-video (1-4 references). Multilingual lip-sync in English, Chinese, and Japanese.
| Spec | Details |
|---|---|
| Developer | ShengShu Technology (Shengshu / Vidu AI) |
| Architecture | U-ViT (Universal Vision Transformer), Foundation World Model |
| Resolution | 360p, 480p, 720p, 1080p |
| Frame Rate | 24fps |
| Duration | Up to 16 seconds per generation |
| Audio | Native synchronized (dialogue + SFX + background music) in one pass |
| Lip-Sync Languages | English, Chinese, Japanese |
| Modes | T2V, I2V, Reference-to-Video (1-4 reference images) |
| Multi-Shot | Yes (automatic camera switching based on narrative) |
| VFX Types | 6 (particle systems, fluid, dynamic motion, camera, transitions, lighting) |
| Aspect Ratios | 16:9, 9:16, 1:1, and more |
| Reference Consistency | Multi-entity from 1-4 reference images (characters, props, styles) |
| Prompt Enhancer | Built-in LLM for automatic prompt improvement (toggleable) |
| Arena Ranking | #1 at launch (Artificial Analysis), Elo 1220-1244 (#2 globally) |
| Generation Time | 2-5 minutes per clip |
| ComfyUI Access | API-based nodes on Floyo (2 workflows) |
| Release Date | January 30, 2026 (Q3) / April 13, 2026 (Reference-to-Video) |
What can you create with Vidu Q3?
Vidu Q3 covers text-to-video, image-to-video, reference-based character-consistent video, multi-shot narrative sequences, product demos with voiceover, animated series episodes, ad creatives with dialogue, cinematic VFX shots, and architectural walkthroughs. All modes produce synchronized audio. The 16-second duration supports complete narrative arcs without manual stitching.
| Capability | What It Does | Use Case |
|---|---|---|
| Text-to-Video | Generate 16-second 1080p video with synchronized dialogue, SFX, and music from a text prompt. Multi-shot camera control and automatic transitions. | Short films, ads, social content, product demos |
| Image-to-Video | Animate a still image into cinematic video with synchronized audio. Preserves composition while adding natural motion, lighting, and camera movement. | Product animation, concept art to motion, photo stories |
| Reference-to-Video | Upload 1-4 reference images (characters, props, costumes, styles). The model generates video maintaining visual consistency for every referenced element. | Animated series, brand characters, consistent campaigns |
| Native Audio-Video | Lip-synced dialogue, ambient soundscapes, Foley SFX, and background music generated alongside video in one pass. No post-dubbing. | Talking head content, narrative ads, immersive scenes |
| Cinematic VFX | 6 VFX types: particle systems, fluid simulation, dynamic motion, camera movement, scene transitions, and lighting effects. | Visual effects, stylized content, cinematic sequences |
| Pipeline Integration | Chain with other models in ComfyUI. Generate a character with Nano Banana, animate with Vidu Q3, add custom narration with Fish Audio S2, upscale with Topaz. | Multi-model production pipelines |
How does Vidu Q3 compare to other video models?
Vidu Q3 leads on native audio-video duration (16 seconds) and narrative storytelling. HappyHorse 1.0 currently holds #1 on the Artificial Analysis Arena with higher Elo. Kling 3.0 Omni offers 4K at 60fps. Seedance 2.0 supports 12-file multi-modal input. Wan 2.7 leads on open-source flexibility. Vidu Q3's edge: longest native audio-video clips, multi-entity reference consistency, and world-model reasoning.
| Model | Duration | Native Audio | Reference Input | Arena Elo |
|---|---|---|---|---|
| Vidu Q3 | 16 seconds | Yes (dialogue + SFX + music) | 1-4 images (multi-entity) | 1,220-1,244 |
| HappyHorse 1.0 | 15 seconds | Yes (7-lang lip-sync) | Subject reference | 1,392 (#1) |
| Kling 3.0 Omni | 15 seconds | Yes (5+ lang) | Elements system | N/A |
| Wan 2.7 | 5-10 seconds | No | Reference images | N/A |
Source: Artificial Analysis Video Arena, SuperCLUE Reference-to-Video leaderboard, ShengShu Technology official announcements, WaveSpeedAI documentation, Atlas Cloud blog, and third-party reviews as of April 2026.
How does Vidu Q3 work?
Vidu Q3 uses ShengShu's proprietary U-ViT (Universal Vision Transformer) architecture, part of the Foundation World Model framework. The model generates video and audio simultaneously by reasoning about motion, sound, and narrative context as a unified prediction task. Text, visual, and audio signals are processed together, which is why the output is synchronized by default.
For text-to-video, your prompt feeds through an optional prompt enhancer, then the U-ViT generates frames and audio tokens together. The model plans camera movements, shot transitions, and audio events as part of the same sequence. For image-to-video, the input image anchors the first frame while the model generates motion, camera work, and audio around it.
Reference-to-Video uses multi-reference fusion. You upload 1-4 images that define characters, props, or styles. The model extracts identity features from each reference and maintains them throughout the generated video. This is different from single-image animation: multiple separate references merge into one coherent scene while each entity keeps its distinct visual identity.
On Floyo, Vidu Q3 runs through ComfyUI API nodes. Your prompt and any reference images are sent to ShengShu's inference servers. The complete video with synchronized audio returns to your ComfyUI canvas as an MP4. You can chain Vidu Q3 with other models: upscale with Topaz, add custom narration with Fish Audio S2, or watermark with Orion 4D.
Fair warning: Vidu Q3 is API-based, not open source. Generation runs on ShengShu's servers. Generation takes 2-5 minutes per clip depending on resolution and duration. The baked audio is convenient but may not match your licensed music library. For brand-critical work, you may want to replace the generated audio with your own assets in post. Fingers and fine details can still get mushy in some generations. API pricing applies through your Floyo API Wallet.
Frequently Asked Questions
Common questions about running Vidu Q3 on Floyo.
You can start with Floyo's free pricing plan. Floyo gives $0.25 in free API credits on signup. To continue using the service beyond the free tier, upgrade your Floyo pricing plan. Vidu Q3 runs as an API node, so generation costs come from your API Wallet (separate from your plan's GPU time).
Open Floyo in your browser, search "Vidu Q3" in the template library, and pick the text-to-video or image-to-video workflow. Click Run, write your prompt or upload an image, and generate. Floyo handles the ComfyUI environment and API connection. No local install, no Python setup, no API key management.
ShengShu Technology, an AI company based in Singapore specializing in multimodal generative AI. Founded by Dr. Zhu Jun. The Vidu Q3 model launched January 30, 2026 and ranked #1 globally on the Artificial Analysis benchmark at launch. ShengShu demonstrated the world's first AI-powered animated series production at SXSW 2026. Vidu is integrated into Alibaba Cloud Model Studio.
Yes. Dialogue with lip-sync, environmental sound effects, and background music are generated in the same pass as the video. No post-dubbing needed. The audio sits around -14 to -12 LUFS, comfortable for social platforms. You can prompt for mood ("gentle electronic," "minimal percussion") or let the model match audio to visuals automatically.
HappyHorse 1.0 currently ranks higher on the Artificial Analysis Arena (Elo 1,392 vs Vidu Q3's 1,220-1,244) and supports 7-language lip-sync. Vidu Q3 generates 16-second clips (vs 15 seconds), supports multi-entity reference consistency from 1-4 images, and offers 6 built-in VFX types. Vidu Q3 is available on Floyo now. HappyHorse is coming soon.
Yes. Floyo runs ComfyUI, which lets you chain multiple models. Generate a character with Nano Banana, animate with Vidu Q3, add custom narration with Fish Audio S2, upscale with Topaz Video AI. Or use Vidu Q3 for the video and replace the baked audio with a licensed track. All in one pipeline.
Check ShengShu's terms of service for commercial usage details. The generated audio includes music and sound effects that may have specific rights considerations. For brand-critical work, many creators replace the baked audio with their own licensed assets in post-production.
Typical generation takes 2-5 minutes depending on resolution and duration. A 16-second 1080p clip takes longer than a 4-second 480p clip. Some API providers offer off-peak pricing for non-urgent jobs that process within 48 hours at reduced cost.
Try Vidu Q3 on Floyo
16-second 1080p video with native audio, multi-shot sequencing, reference consistency, and cinematic VFX. Run it in your browser.
Try Vidu Q3 Now → Browse All ModelsRelated Reading
Film and Animation Workflows on Floyo
AI Ad Creatives for Social and Web
Last updated: April 2026. Specs from ShengShu Technology official announcements (PRNewswire), Vidu official website (vidu.com), Artificial Analysis Video Arena, SuperCLUE Reference-to-Video leaderboard, WaveSpeedAI documentation, Atlas Cloud blog, Novita AI documentation, and third-party reviews.
Vidu Q3 for Image to Video
Turn to images to real life
Vidu Q3 for Text to Video
Create good videos with Vidu Q3

