Create with Alibaba Happy Horse model now! Try here 👉

Pricing

Create with Alibaba Happy Horse model now! Try here 👉

COMMUNITY PAGE

Run HappyHorse on Floyo

Home / Model / HappyHorse 1.0 on Floyo

AI VIDEO GENERATION

Run HappyHorse 1.0 on Floyo

#1 on the Artificial Analysis Video Arena. 15B parameter unified Transformer that generates 1080p video with synchronized audio in a single forward pass. Multi-shot sequencing, lip-sync in 7 languages, and cinematic shallow depth-of-field.

Run Alibaba's HappyHorse 1.0 through ComfyUI in your browser. No API key, no installs, no local GPU.

Resolution

Native 1080p

Arena Ranking

#1 Elo (T2V + I2V)

Duration

Up to 15 seconds

Audio

Joint audio-video (7 languages)

Run now on Floyo →

Browse All Models

No installation. Runs in browser. Updated April 2026.

What you get?

HappyHorse 1.0 is Alibaba Token Hub's 15B parameter video generation model that ranked #1 on the Artificial Analysis Video Arena in both Text-to-Video (Elo 1,333) and Image-to-Video (Elo 1,392) on launch day. It uses a unified 40-layer self-attention Transformer that generates video and synchronized audio (lip-sync, ambient sound, Foley) in a single forward pass with no cross-attention. Supports T2V, I2V, S2V (Subject-to-Video), V2V editing, and SV2V (Subject+Video-to-Video). Up to 15 seconds of 1080p multi-shot video. Lip-sync in 7 languages. Coming soon as a ComfyUI API node on Floyo.

What is HappyHorse 1.0?

HappyHorse 1.0 is a 15B parameter video generation model from Alibaba's ATH (Alibaba Token Hub) business unit. It appeared anonymously on the Artificial Analysis Video Arena on April 7, 2026, and within days took the #1 spot in both Text-to-Video (Elo 1,333) and Image-to-Video (Elo 1,392). Alibaba officially claimed the model on April 10, 2026. The API launched April 27, 2026 through fal as the first official provider.

The model was led by Zhang Di, a 15-year AI industry veteran who previously served as VP at Kuaishou and was the technical architect of Kling AI. The project sits under Alibaba's Taotian Future Life Lab, which is consumer-facing (e-commerce, advertising, entertainment), not enterprise cloud. This points toward product integration with Taobao, Tmall, and Alibaba's social commerce ecosystem.

HappyHorse generates video and audio jointly in a single forward pass. This is not a video model with a TTS system pasted on. Text, image, video, and audio tokens are placed in one flat sequence and attend to each other via standard self-attention. The result is synchronized lip-sync, ambient soundscapes, and emotionally expressive vocal performances that are generated together with the visual content.

The model excels at cinematic output. Wide-aperture shallow depth-of-field, atmospheric lighting, refined texture and detail, and rich spatial depth are consistent strengths. Multi-shot consistency maintains stable character positioning across frequent camera cuts. High-speed dynamic action (motorcycle chases, racing circuits, night sequences) renders with physical plausibility.

On Floyo, HappyHorse 1.0 will run through ComfyUI API nodes. You will be able to chain it with other models in the same workflow: generate with HappyHorse, upscale with Topaz, add custom narration with Fish Audio S2. The ComfyUI integration is coming soon.

What are HappyHorse 1.0's technical specifications?

HappyHorse 1.0 is a 15B parameter unified single-stream Transformer with 40 self-attention layers (8 modality-specific + 32 shared). It generates 1080p video at up to 15 seconds with synchronized audio in a single forward pass. No cross-attention. Lip-sync in 7 languages. Inference takes about 38 seconds for a 1080p clip on a single H100. Supports 720p and 1080p at 16:9, 9:16, 1:1, 4:3, and 3:4 aspect ratios.

Spec	Details
Developer	Alibaba Token Hub (ATH) / Taotian Future Life Lab
Led By	Zhang Di (former VP Kuaishou, technical architect of Kling AI)
Architecture	Unified 40-layer single-stream self-attention Transformer (no cross-attention)
Parameters	15 billion
Layer Structure	Layers 1-4 + 37-40: modality-specific projections. Layers 5-36: fully shared across all modalities.
Resolution	720p or 1080p
Duration	Up to 15 seconds per generation
Aspect Ratios	16:9, 9:16, 1:1, 4:3, 3:4
Audio	Joint audio-video generation (lip-sync, ambient, Foley) in one forward pass
Lip-Sync Languages	7 (Mandarin, Cantonese, English, Japanese, Korean, German, French)
Modes	T2V, I2V, S2V (Subject-to-Video), V2V (Video Edit), SV2V (Subject+Video-to-Video)
Multi-Shot	Yes (multiple camera cuts with character consistency)
Inference Speed	~38 seconds per 1080p clip (single H100)
Arena Rankings	#1 T2V (Elo 1,333) + #1 I2V (Elo 1,392) on Artificial Analysis
Sample Count	14,099 arena samples (±6 Elo confidence interval)
Open Source	Planned (base model, distilled model, super-res model, inference code)
ComfyUI Access	API-based nodes (coming soon to Floyo)
Release Date	April 7, 2026 (arena debut) / April 10, 2026 (official reveal) / April 27, 2026 (API launch)

What can you create with HappyHorse 1.0?

HappyHorse 1.0 covers text-to-video, image-to-video, subject-to-video (character insertion), video-to-video editing, and subject+video-to-video (character replacement in existing footage). All modes produce synchronized audio. The model is designed for advertising, short-form video production, and social media marketing with near-live-action visual quality.

Capability	What It Does	Use Case
Text-to-Video	Generate 1080p video with synchronized audio from a text prompt. Multi-shot sequencing with camera cuts and character consistency.	Ads, short films, social content, product demos
Image-to-Video	Animate a still image into cinematic video with synchronized audio. Preserves the source image composition while adding natural motion.	Photo animation, product showcases, hero content
Subject-to-Video	Insert a specific subject from a reference image into generated video while preserving their appearance and identity.	Brand ambassadors, character content, personalized ads
Video Editing (V2V)	Modify existing video while preserving original structure and motion. Style transfer, element replacement, lighting adjustments.	Post-production, footage restyling, client revisions
Subject Replacement (SV2V)	Replace or insert a subject from a reference image into existing video. Preserves the original video's motion, composition, and unaffected regions.	Talent swaps, product placement, localized variations
Joint Audio Generation	Lip-synced dialogue in 7 languages, ambient soundscapes, Foley sounds, and emotionally expressive vocals. Generated in the same forward pass as video.	Talking head content, music videos, immersive scenes

What are HappyHorse 1.0's key features?

HappyHorse 1.0's feature set is designed for professional video production. The unified Transformer generates video and audio together. The cinematic pipeline handles shallow depth-of-field, atmospheric lighting, and multi-shot consistency. The 5-mode system (T2V, I2V, S2V, V2V, SV2V) covers the full production workflow from creation to editing.

Unified Single-Stream Transformer

HappyHorse does not use cross-attention to connect text and video. Text, image, video, and audio tokens sit in one flat sequence and attend to each other via standard self-attention. The 40-layer "sandwich" architecture uses 8 modality-specific projection layers (4 at the start, 4 at the end) and 32 fully shared layers in between. This is why audio and video are synchronized by default rather than aligned after the fact.

Joint Audio-Video Generation

Audio is not post-dubbed. Lip-synced dialogue, ambient soundscapes, Foley sounds, and emotionally expressive vocal performances are generated in the same forward pass as the video. Lip-sync supports 7 languages: Mandarin Chinese, Cantonese, English, Japanese, Korean, German, and French. The audio understands what is happening visually and matches it, including environmental acoustics.

Cinematic Depth-of-Field

The model excels at wide-aperture, shallow depth-of-field cinematography. Atmospheric visual language, fine-grained image texture, and rich spatial depth and visual layering are consistent strengths. This gives output a "shot on cinema glass" look rather than the flat, evenly-lit quality common in AI-generated video.

Multi-Shot Consistency

The model maintains stable character positioning across frequent camera cut transitions within a single generation. This makes it suited for short dramas with camera movement and emotional atmosphere: suspenseful confrontation scenes, romance narratives, and dialogue-driven sequences. A 15-second output can feel like an edited multi-shot sequence rather than one continuous clip.

High-Speed Action

Physical simulation for high-speed dynamic scenes: street motorcycle chases, high-speed tracking shots on racing circuits, and night-time motorbike sequences. Objects interact with physically convincing behavior. Camera motion (tracking, crane, dolly) follows the action naturally. This extends the model's use beyond talking-head content into action-oriented production.

5-Mode Production System

T2V (text-to-video), I2V (image-to-video), S2V (subject-to-video), V2V (video editing), and SV2V (subject+video-to-video) cover the full lifecycle. Generate from scratch, animate a still, insert a character, edit existing footage, or swap a subject in an existing clip. All five modes share the same unified architecture and produce synchronized audio.

How does HappyHorse 1.0 compare to other video models?

HappyHorse 1.0 holds #1 on the Artificial Analysis Video Arena in both T2V and I2V with 14,099 samples and a tight ±6 Elo confidence interval. Seedance 2.0 leads on multi-modal reference input (12 files). Kling 3.0 Omni offers 4K at 60fps with multi-shot storyboarding. Wan 2.7 leads on open-source flexibility. HappyHorse's edge: highest arena ranking, joint audio-video in one pass, and cinematic depth-of-field quality.

Model	Resolution	Audio	Architecture	Arena Elo (I2V)
HappyHorse 1.0	1080p	Joint (7 lang lip-sync)	Unified single-stream	1,392 (#1)
Seedance 2.0	2K	Yes (8+ lang)	Dual-Branch DiT	1,269
Kling 3.0 Omni	4K @ 60fps	Yes (5+ lang)	DiT + 3D spatiotemporal	N/A
Wan 2.7	Up to 4K	No	DiT + thinking mode	N/A

Source: Artificial Analysis Video Arena (April 2026), fal API documentation, Alibaba ATH official materials, CNBC and Bloomberg reporting, and third-party benchmark comparisons.

How does HappyHorse 1.0 work?

HappyHorse 1.0 uses a unified 40-layer self-attention Transformer with 15 billion parameters. Text, image, video, and audio tokens are placed in one flat sequence. The first 4 and last 4 layers are modality-specific projections (separate learned projections for each token type). The middle 32 layers are fully shared: all modalities attend to each other via standard self-attention with no cross-attention gating.

This single-stream design is what makes joint audio-video generation possible. Because audio and video tokens share the same attention space, the model naturally learns the relationship between visual events and their sounds. A door slamming in the video produces a corresponding sound. A character speaking produces lip-synced audio in the matching language. The synchronization is learned, not engineered.

The cinematic quality comes from training data curation. The model was trained on data selected for cinematographic qualities: shallow depth-of-field, atmospheric lighting, consistent color grading, and professional framing. This is why output looks like footage from a cinema camera rather than a phone camera, even when the prompt is simple.

On Floyo, HappyHorse 1.0 will run through ComfyUI API nodes. Your prompt and any reference images are sent to inference servers, and the generated video with synchronized audio streams back to your ComfyUI canvas. You will be able to chain HappyHorse with local processing nodes in the same workflow: upscale, color grade, add subtitles, or combine with other model outputs.

Note: HappyHorse 1.0 is API-based. The model appeared on April 7, 2026, was revealed by Alibaba on April 10, and the fal API launched April 27. Open-source weights are planned (base model, distilled model, super-resolution model, inference code) but not yet released. Content filtering is active on the API. On Floyo, the ComfyUI integration is coming soon.

Frequently Asked Questions

Common questions about running HappyHorse 1.0 on Floyo.

Is HappyHorse 1.0 free to use on Floyo?

You can start with Floyo's free pricing plan. Floyo gives $0.25 in free API credits on signup. To continue using the service beyond the free tier, upgrade your Floyo pricing plan. HappyHorse 1.0 runs as an API node, so generation costs come from your API Wallet (separate from your plan's GPU time).

How do I run HappyHorse 1.0 without installing anything?

Once available on Floyo, open the platform in your browser, find a HappyHorse workflow (search "HappyHorse" in the template library), and click Run. Write your prompt, optionally upload reference images, and generate. Floyo handles the ComfyUI environment and API connection. No local install, no Python setup.

Who made HappyHorse 1.0?

Alibaba's ATH (Alibaba Token Hub) business unit, through its Taotian Future Life Lab. The project is led by Zhang Di, former VP at Kuaishou and technical architect of Kling AI. HappyHorse appeared anonymously on the Artificial Analysis Video Arena on April 7, 2026. Alibaba officially claimed it on April 10. The fal API launched April 27.

Does HappyHorse generate audio with video?

Yes. Audio and video are generated in a single forward pass. The unified Transformer produces lip-synced dialogue in 7 languages (Mandarin, Cantonese, English, Japanese, Korean, German, French), ambient soundscapes, Foley sounds, and emotionally expressive vocals. This is not audio pasted onto silent video.

How does HappyHorse compare to Seedance 2.0?

HappyHorse ranks higher on the Artificial Analysis Video Arena (Elo 1,392 I2V vs Seedance's 1,269). Seedance 2.0 leads on multi-modal reference input (12 files per generation) and supports 2K resolution. HappyHorse leads on cinematic depth-of-field quality and joint audio-video synchronization from a single unified architecture. Both will be available on Floyo.

Can I combine HappyHorse with other AI models in one workflow?

Yes. That is the advantage of running HappyHorse through ComfyUI on Floyo. Generate with HappyHorse, upscale with Topaz Video AI, add custom narration with Fish Audio S2 or Chatterbox, or composite multiple outputs together. All in one pipeline, all in your browser.

Will HappyHorse be open source?

Alibaba has stated that the base model, distilled model, super-resolution model, and inference code are planned for open-source release. As of April 2026, the GitHub and Model Hub links have not yet activated. No public weights are available yet. The API is live through fal as the first official provider.

When will HappyHorse 1.0 be available on Floyo?

HappyHorse 1.0 is coming soon to Floyo as a ComfyUI API node. The fal API launched April 27, 2026, and Floyo is working on the integration. Check back for updates or sign up to be notified when the workflow goes live.

HappyHorse 1.0 is Coming to Floyo

#1 on the Artificial Analysis Video Arena. 1080p with joint audio-video, multi-shot sequencing, cinematic depth-of-field, and 7-language lip-sync. Run it in your browser.

Run now on Floyo → Browse All Models