How JoyAI‑Echo Overcomes Forgetting in Minute‑Long Video Generation

JoyAI‑Echo introduces a cross‑modal audio‑visual memory bank, a three‑stage post‑training pipeline, and a Director Agent to enable consistent, high‑quality, real‑time generation of minute‑level videos, achieving up to 7.5× inference speedup and state‑of‑the‑art benchmark scores.

SuanNi
SuanNi
SuanNi
How JoyAI‑Echo Overcomes Forgetting in Minute‑Long Video Generation

Generating minute‑level continuous narrative video is plagued by character inconsistency, voice mismatch, degraded quality, and slow speed.

Memory solves forgetting

JoyAI‑Echo builds a Cross‑Modal Audio‑Visual Memory Bank. Each slot pairs visual memory (appearance, expression) with audio memory (voice timbre) for the same historical event. During generation, the model retrieves relevant slots, using the first three slots as anchors and the latest four as context, ensuring consistency over 5‑minute videos.

Attention design: In the audio branch, the first 70 % of Transformer layers mask memory tokens, focusing on current speech; the last 30 % open memory interaction. Cross‑modal interaction uses slot‑aligned masks so slot i only talks to its counterpart, preventing mismatched face‑voice pairing.

Training tricks: memory‑length‑aware loss weighting strengthens supervision for slots; audio‑to‑video gradient amplification multiplies gradients by a factor that grows from 2× to 6×, tightening lip‑speech coupling.

Data construction

The team collected millions of long‑form videos, clustering by identity to build a corpus where each identity appears in diverse lighting, clothing, expressions, and backgrounds. A four‑step pipeline (global identity clustering, scene grouping, local role assignment, diversity filtering) yields over 1 M unique identities, each linked to multiple high‑quality clips.

Post‑training pipeline

Memory‑aware SFT : Single‑shot videos are treated as zero‑memory multi‑shot cases; multi‑shot data are sampled during fine‑tuning, and resolution is progressively increased from 480p to 720p.

Cross‑modal RLHF (OmniNFT) : Addresses three pitfalls of naïve multimodal RL—misaligned video/audio rewards, gradient leakage, and uniform credit assignment—by routing rewards per modality, cutting video gradients in shallow audio layers, and weighting loss on attention‑identified speech regions.

Memory‑aware DMD distillation : Compresses a multi‑step teacher into an 8‑step student while sharing the same memory conditions, using EMA‑smoothed optimizer momentum and a 1:0.5 video‑to‑audio loss weight. Memory inputs are degraded during distillation to improve robustness.

Director Agent for interactive creation

The Agent expands vague user intent into a script, role cards, scene cards, and shot plan. It retrieves relevant memory entries, invokes JoyAI‑Echo, and writes results back to a history manager. Fixed memory encodes identity, appearance, and timbre; dynamic memory selects KOK (key‑frame‑to‑key‑shot) pairs for narrative continuity. Users can iteratively review and edit individual shots; the Agent updates only affected slots, enabling “watch‑and‑revise” workflow.

Audio‑visual super‑resolution

A joint super‑resolution module treats up‑sampling as conditional generation: given low‑resolution video and coarse audio latents, a single diffusion step produces high‑resolution video and refined audio. Two scales are supported: 736×1280 → 1152×1920 (1K) and 736×1280 → 1472×2560 (2K), using the same architecture and distillation pipeline.

Performance evaluation

A benchmark of 100 stories (3 000 shots, 241 frames each at 25 fps) measures cross‑shot consistency, video quality, text alignment, and speech accuracy. Blind pairwise user study shows JoyAI‑Echo outperforms Happy Oyster’s director mode and short‑video specialist Wan 2.6, especially in audio quality and prompt adherence (>80 % preference).

Quantitative metrics: ViCLIP similarity 0.8026, Self‑CIDS 0.7793, speech consistency 0.8129 (all highest); aesthetic quality 0.5679, imaging quality 0.7058, CLIP text‑alignment 0.2658, speech content accuracy 0.8646. Compared to baselines, Self‑CIDS improves +0.0302 and speech consistency +0.0184.

Inference runs at 7.5× speedup: an 8‑step student generates 720p video and audio latents, then feeds them to the super‑resolution step for a single forward pass, achieving real‑time generation of minute‑level videos.

Code and model weights are publicly released on Hugging Face and GitHub.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

super-resolutionreal-time inferencelong video generationdirector agentJoyAI-Echoaudio-visual AIcross-modal memory
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.