Artificial Intelligence 9 min read

From Q&A to Real‑Time Seeing & Speaking: JD’s First Open‑Source JoyAI‑VL‑Interaction

JD’s open‑source JoyAI‑VL‑Interaction transforms large‑model AI from static question‑answering to continuous, on‑scene observation, proactive judgment, and real‑time response, offering agent delegation and achieving up to 87.9% win rate against leading video assistants in live benchmarks.

JD Cloud Developers

Jun 23, 2026

From Q&A to Real‑Time Seeing & Speaking: JD’s First Open‑Source JoyAI‑VL‑Interaction

Current multimodal models mainly operate in a question‑answer (Q&A) fashion: users upload an image or video, ask a question, and the model replies. While sufficient for static analysis, this approach falls short when AI must act in the physical world, where timely, proactive responses are critical.

JoyAI‑VL‑Interaction: A Full‑Stack Open‑Source Solution

JoyAI‑VL‑Interaction, released by JD, is the world’s first fully open‑source interaction model and system that supports day‑0 native integration with vLLM‑Omni. It enables large models to move from “Q&A” to “see‑and‑speak” by continuously observing video streams, making autonomous judgments about when to speak or stay silent, and delegating complex tasks to a backend agent.

Three Core Breakthroughs

1. Proactive Judgment, Not Passive Answering – Traditional models wait for a user query before processing the current frame. JoyAI‑VL‑Interaction continuously watches the video, decides autonomously when to speak, and when to remain silent. For example, if a user sets the rule “alert me when the referee shows a red card,” the model monitors the match and issues an instant warning at the moment the card appears, without waiting for a follow‑up question.

2. Real‑Time Response, Not Post‑Event Summarization – Conventional video understanding processes a completed video, which introduces latency unsuitable for security alerts, live translation, or interactive guidance. JoyAI‑VL‑Interaction operates on live streams, reacting immediately to frame‑level changes.

3. Intelligent Agent Delegation While Maintaining Observation – When encountering tasks such as code generation, tool use, or complex reasoning, the front‑end model hands the request to a backend large model or agent, continues to monitor the scene, and seamlessly resumes the dialogue once the backend returns results. This creates a “front‑end real‑time assistant + back‑end intelligent brain” collaboration.

System Architecture and Extensibility

The open‑source release includes model weights, interaction datasets, training recipes, and a deployable system. It supports various video inputs (camera, live stream, surveillance), audio I/O, visual interfaces, long‑term memory, and interchangeable ASR/TTS, backend model, and tool modules. Developers can replace any component with their own services or APIs.

Benchmark Results

In evaluations covering 58 real‑world blind‑review cases—monitoring alerts, real‑time counting, translation, time‑aware narration, and live guide—the model achieved a 77.6% overall win rate against the Doubao video‑call assistant and 87.9% against Gemini’s video‑call assistant. In the monitoring‑alert scenario, JoyAI‑VL‑Interaction attained a 100% win rate, demonstrating the advantage of intrinsic proactive interaction over external rule‑based triggers.

Use Cases and Impact

Beyond research, the framework can be adapted for safety monitoring, elderly care, live commentary, e‑commerce guidance, operational assistance, AI glasses, and accessibility tools. By providing a complete stack rather than a single model, it lowers the engineering barrier for deploying real‑time AI assistants in physical environments.

Related Resources

Code repository: https://github.com/jd-opensource/JoyAI-VL-Interaction

Model checkpoint: https://huggingface.co/jdopensource/JoyAI-VL-Interaction-Preview

Dataset: https://huggingface.co/datasets/jdopensource/JoyAI-VL-Interaction

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI open source benchmark AI assistant real-time interaction vision-language model

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.