HappyHorse Model
← Back to articles
Model Guides2 min read

HappyHorse 1.0: The AI Video Model That Came From Nowhere

In early April 2026, a model called HappyHorse-1.0 appeared on the Artificial Analysis Video Arena leaderboard under a pseudonymous identity. Within hours it sat at #1 in both text-to-video and image-to-video rankings, crushing previously dominant models by margins that hadn't been seen in the arena's history.

Then, days later, it vanished from the public leaderboard entirely.

What the Leaderboard Data Shows

Artificial Analysis runs a blind voting arena: users see two videos from the same prompt, pick the better one without knowing which model produced it, and those votes feed into an Elo rating system. This is the same methodology used in chess rankings — no self-reported benchmarks, no cherry-picked demos.

HappyHorse-1.0's peak positions:

CategoryElo ScoreRankGap vs. #2
Text-to-Video (no audio)1333#1+60 over Seedance 2.0
Image-to-Video (no audio)1392#1+37 over Seedance 2.0
Text-to-Video (with audio)1205#2-14 behind Seedance 2.0
Image-to-Video (with audio)1161#2-1 behind Seedance 2.0

A 60-point Elo gap in the T2V category means HappyHorse would win roughly 58-59% of head-to-head blind matchups against Seedance 2.0. This is not noise — it represents a meaningful quality difference as perceived by human voters.

Claimed Architecture

Everything in this section comes from happyhorses.io and happyhorse-ai.com. None of it has been independently verified.

The site describes a 40-layer single-stream Self-Attention Transformer with 15 billion parameters. The design eliminates cross-attention entirely:

  • First 4 layers: Modality-specific projections (text, image, video, audio)
  • Middle 32 layers: Shared parameters across all modalities
  • Last 4 layers: Modality-specific output projections

Text tokens, reference image latents, and noisy video/audio tokens are jointly denoised within a single token sequence. The model reportedly needs only 8 denoising steps with no Classifier-Free Guidance — a significant departure from the 20-50 step diffusion processes used by most competitors.

Multilingual Audio-Video Generation

The site claims native support for six languages: Chinese, English, Japanese, Korean, German, and French. A secondary marketing site adds Cantonese as a seventh language with "ultra-low WER lip-sync."

The joint audio-video generation means dialogue, ambient sounds, and Foley effects are produced alongside video frames in a single pass — no post-production dubbing pipeline required.

What's Not Available

As of April 8, 2026:

  • No downloadable weights: Both GitHub and Model Hub links say "Coming Soon"
  • No public API: No endpoint, no documented pricing, no SLA
  • No paper: No technical report has been published
  • No team identity: Artificial Analysis used the word "pseudonymous" for the submission

The site states "Base model, distilled model, super-resolution model, and inference code — all released." The links contradict this claim.

Why It Matters

Even without accessible weights, HappyHorse-1.0 demonstrated three things:

  1. Single-stream architectures can reach SOTA: The elimination of cross-attention and multi-stream pipelines was considered a tradeoff. HappyHorse suggests it might be an upgrade.

  2. 8-step inference is viable for top-quality output: This has major implications for inference cost and throughput at scale.

  3. Anonymous drops work: The "rank first, reveal later" strategy — previously seen with Pony Alpha / GLM-5 — is becoming a pattern in the Chinese AI ecosystem.

For teams building video generation pipelines, the practical leaderboard starts at position #3 (SkyReels V4) since neither HappyHorse nor Seedance 2.0 offers public API access. But if HappyHorse releases weights, the calculus changes overnight.