How JD’s Open‑Source JoyAI‑Echo Overcomes the Three Biggest Long‑Video Generation Challenges
JoyAI‑Echo, JD’s newly open‑sourced long‑video generation framework, tackles character inconsistency, voice instability, and slow rendering by introducing a cross‑modal memory bank, memory‑driven training with DMD for 7.5× speedup, a conversational Director Agent, and real‑time super‑resolution, achieving leading benchmark scores and high user preference.
Current AI short‑video tools produce high‑quality clips, but extending generation to minute‑level videos exposes three critical problems: (1) the same character looks different across consecutive shots, (2) the speaker’s timbre fluctuates or changes abruptly, and (3) generation speed is prohibitively slow, often taking minutes for a single result.
JoyAI‑Echo addresses these issues with four concrete technical innovations:
Cross‑modal audio‑video memory bank : a dedicated repository that continuously stores and retrieves a character’s visual features and voice timbre throughout multi‑shot generation, ensuring that a five‑minute video maintains consistent identity, appearance, and sound.
Memory‑driven training pipeline : combines Supervised Fine‑Tuning (SFT), cross‑modal Reinforcement Learning with Human Feedback (RLHF), and Distribution Matching Distillation (DMD). DMD alone contributes roughly a 7.5× speed increase, turning “half‑day” generation into near‑instant output.
Director Agent : an interactive “director assistant” that parses natural‑language requests into script, characters, scenes, and shots. Users can edit specific segments via dialogue, triggering partial re‑generation without re‑running the entire video.
Lightweight real‑time super‑resolution : supports two up‑scaling modes—736×1280 → 1152×1920 and 736×1280 → 1472×2560—delivering high‑definition frames even under streaming‑latency constraints.
To evaluate performance, the research team built a dedicated benchmark consisting of 100 stories and 3,000 shots. Across metrics such as cross‑shot consistency, overall video quality, text‑video alignment, and speech content accuracy, JoyAI‑Echo achieved leading results, with speech accuracy reaching 0.8646. A user‑preference survey reported 81.7% favoring audio quality, 80.6% preferring prompt adherence, 63.6% appreciating visual aesthetics, and 59.4% noting IP consistency. The framework also outperformed competitors on visual aesthetics and prompt compliance in portrait short‑video tasks.
The release opens new possibilities for virtual story creation, digital‑human content production, rapid brand‑marketing video iteration, film pre‑visualization and storyboard generation, interactive educational material, and game cutscene creation. All code, model weights, and documentation are fully open‑source on GitHub (https://github.com/jd‑opensource/JoyAI‑Echo) and the project homepage, inviting developers to experiment and extend the system.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
