Artificial Intelligence 12 min read

How JD’s JoyStreamer Achieves Smooth Long‑Form, Free‑Form Digital Human Live Streams

The article details how JD’s JoyStreamer and JoyStreamer‑Flash models overcome text‑control weakness, multimodal conflict, and identity drift to enable long‑duration, free‑state, real‑time interactive digital‑human video generation, surpassing current SOTA models in benchmark scores and reaching 30 FPS inference speed for e‑commerce live streaming.

Machine Heart

Mar 31, 2026

How JD’s JoyStreamer Achieves Smooth Long‑Form, Free‑Form Digital Human Live Streams

At the 2026 GTC conference the industry consensus was that AI is entering the agent era, yet most intelligent agents lack a "flexible" physical embodiment. Existing digital‑human models struggle with weak textual command control, multimodal signal conflicts, and limited generation length.

JD’s JoyStreamer and JoyStreamer‑Flash large models address these pain points. By improving textual command strength, resolving multimodal conflicts, and supporting long‑duration, free‑state, real‑time interaction, the models achieve performance that surpasses current SOTA solutions. The results are documented in arXiv papers (https://arxiv.org/pdf/2602.00702, https://arxiv.org/abs/2512.11423) and a technical homepage (https://joystreamer.github.io/).

The core technical innovations include a dual‑teacher distribution‑matching distillation (DMD) stage after pre‑training. One teacher focuses on audio (lip‑sync and rhythm) while the other leverages a video foundation model for textual control. This separation allows the digital‑human model to inherit strong text controllability without additional data.

To mitigate multimodal control conflicts, the authors introduce a dynamic classifier‑free guidance (CFG) modulation strategy. Analysis of diffusion video generation shows that coarse motion frames are formed in early high‑noise steps, while fine lip‑sync details emerge in later low‑noise steps. The model therefore prioritises textual motion instructions early, then hands control to audio for precise lip synchronization, preventing signal interference.

Long‑video generation faces the "identity drift" problem, where the avatar’s face or clothing changes over time. JoyStreamer solves this with a history‑frame encoding module (FramePack) combined with a pseudo‑last‑frame strategy: during inference, a reference image is repeatedly injected as a virtual last frame, anchoring the avatar’s identity and enabling stable generation of videos longer than 30 seconds.

Subjective GSB (Good+Same)/(Bad+Same) evaluations compare JoyStreamer (Ours) with leading closed‑source SOTA models. JoyStreamer shows significant advantages in text compliance, lip‑sync accuracy, ID retention, and video quality, achieving overall GSB scores of 1.36 (vs. omnihuman‑1.5) and 1.73 (vs. KlingAvatar2.0).

For inference speed, JoyStreamer‑Flash converts the bidirectional model into an autoregressive single‑direction model using CausVid and Self‑Forcing techniques, then applies 4‑step sampling, kv‑cache, and multi‑GPU parallelism to reach 30 FPS generation.

Additional innovations such as progressive step guidance, motion‑condition injection, and an infinite RoPE cache‑reset further enable real‑time, unlimited‑length, high‑fidelity digital‑human video generation with superior visual quality, temporal consistency, and lip‑sync.

In e‑commerce live‑streaming scenarios, the new free‑state digital humans can move naturally, perform complex object interactions, and maintain high‑quality visuals, extending viewer dwell time. JD’s platform offers the digital‑human live‑streaming capability free of charge, allowing merchants to configure custom avatars or clone real hosts with a single video input. A case study (NewXli) reported a 60% increase in public‑domain traffic and nearly two‑minute average viewer stay after using the "live‑room cloning" feature.

Beyond the product, JD’s AI strategy emphasises efficiency over raw parameter count. The recently open‑sourced JoyAI‑LLM Flash model has 48 B parameters but activates only 3 B via dynamic sparse routing, consuming one‑fifth the tokens of competing models while maintaining strong performance.

Future directions include enabling avatar outfit changes, richer multi‑anchor interactions, and eliminating hallucinations—challenges that JD’s team claims remain unsolved industry‑wide.

Real-time Streaming digital human Generative AI e-commerce live streaming JoyStreamer long video generation multimodal control

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.