Artificial Intelligence 11 min read

Testing Alibaba’s HappyHorse 1.0: All‑in‑One Audio‑Video AI That Edits Itself

Alibaba’s HappyHorse 1.0, a native multimodal video generation model launched on April 27, combines audio‑video synthesis and editing in a single platform, tops several AI video benchmarks, offers low‑cost per‑second pricing, and demonstrates strong scene understanding through a series of prompt‑driven examples, while still showing minor glitches such as occasional text artifacts.

Machine Heart

Apr 27, 2026

Testing Alibaba’s HappyHorse 1.0: All‑in‑One Audio‑Video AI That Edits Itself

On April 27, Alibaba’s ATH team released HappyHorse 1.0, a native multimodal video generation model that delivers audio‑video creation and editing within a single platform.

The architecture enables “one‑pot” output: users can generate videos from text, from images, or from a mix of multiple reference images, and then edit the results without re‑shooting.

In external evaluations, HappyHorse 1.0 topped the Artificial Analysis leaderboard, taking first place in both text‑to‑video and image‑to‑video categories and pushing Seedance 2.0 to second. On the Arena ranking, it ranked first for video editing and second for text‑ and image‑to‑video generation.

Pricing is positioned as cost‑effective: 720p generation costs ¥0.9 per second and 1080p ¥1.6 per second; a professional monthly subscription with limited‑time discounts reduces the rates to ¥0.44 and ¥0.78 per second respectively.

The model is publicly accessible: professional creators and enterprise customers can use it via the HappyHorse website or Alibaba Cloud Bailei platform, while general users can try it in the Qianwen app.

Video Generation Capabilities

HappyHorse 1.0 supports three generation modes: pure text‑to‑video, image‑to‑video, and multi‑image reference video. Simple prompts produce complex scenes. For example, the prompt “ A cyclist racing through a narrow alley, handheld camera feel, dynamic motion blur, realistic shadows, intense pacing ” yields a smooth, naturally cut sequence of a cyclist navigating a tight lane.

Another prompt describing “ High‑speed chase thriller reimagined as a hamster in a toy car pursuing a rolling cheese wheel through a kitchen obstacle course, low‑angle ground‑level pursuits, quick‑cut jumps, barrel rolls, triumphant slow‑mo finish with confetti explosions ” generates a vivid, cartoon‑style chase with coherent camera moves.

When asked to create a one‑minute basketball advertisement (“ Make a professional ad for basketball ”), the model produces continuous dribbling, jumping, and shooting actions, automatically inserting slow‑motion highlights and a placeholder for a brand logo, closely resembling a commercial‑grade clip.

In a K‑pop girl‑group MV scenario (“ K‑pop girl group MV. Futuristic studio, five members in pink‑white outfits, energetic dance, wide‑angle dolly to close‑up wink, ending with a group freeze amid rain of silver sequins ”), HappyHorse 1.0 synchronizes choreography, camera transitions, and lighting to deliver a polished music‑video style output.

A more challenging test mixes humans and robots in a soccer match (“ Soccer of the future, mixing people and robots. Fragment from a 2026 cinematic movie ”). The resulting video shows seamless ball handling, dribbling, and a final goal, with multiple subjects moving cooperatively in a single shot.

The model also handles arbitrary durations from 3 to 15 seconds, automatically adjusting shot composition based on the requested length.

Stylistic flexibility is demonstrated with a tiny‑city micro‑landscape prompt (“ tiny city built on a desk, small cars moving, camera fly‑through, playful, crisp detail ”), producing accurate perspective, depth of field, and smooth camera trajectories.

Image‑to‑Video (Text‑to‑Video) Performance

Using a 3×3 grid of Beijing travel photos, HappyHorse 1.0 generated a travel vlog where each frame preserves the original subjects, composition, clothing, expressions, and locations, while adding natural handheld camera shake and subtle motion.

Minor issues appear, such as garbled text in the final frame of some videos.

Video Editing Capabilities

HappyHorse 1.0 allows one‑sentence edits. Replacing a cat with a golden retriever retains tail wagging, sofa background, and camera cuts, even preserving sunglasses on the animal. Adding a stylish blonde model walking out of a convenience store while a car passes demonstrates the model’s ability to insert new elements that respect spatial logic, camera angle, and lighting.

Style conversion from anime to photorealistic rendering proceeds without noticeable artifacts or distortion of characters and motions.

Conclusion

HappyHorse 1.0 showcases strong fundamentals in visual quality, realistic character rendering, and fluid camera motion, addressing core challenges that content creators face daily. By integrating generation and editing, it marks a significant step forward, though occasional bugs indicate the model is still evolving.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba multimodal AI video editing AI video generation HappyHorse

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.