JD Donates Oxygen xLLM Large‑Model Inference Engine to OpenAtom Foundation to Boost Domestic AI Infra
JD donated its self‑developed Oxygen xLLM large‑model inference engine to the OpenAtom Open Source Foundation under Apache 2.0, highlighting its service‑engine decoupled architecture, heterogeneous‑chip support, proven performance gains in e‑commerce, power and public‑safety use cases, and a roadmap to become the domestic AI‑infra standard.
Background and Vision
On 2026‑06‑25 JD donated its self‑developed large‑model inference engine Oxygen xLLM to the OpenAtom Open Source Foundation under the Apache 2.0 license, transferring copyright, related patents, trademark and associated rights.
JD’s AI Infra & Big Data Computing head Zhang Ke described the next stage of AI infrastructure as Engineering Intelligence (EI) : scheduling systems autonomously perceive workload characteristics, inference engines automatically generate optimal execution plans based on model structure and hardware traits, and the entire AI pipeline becomes self‑aware, self‑deciding and self‑optimising.
Problem Space
Production‑grade large‑model deployment faces three core challenges:
SLO compliance and resource efficiency are hard to achieve simultaneously.
Insufficient exploitation of hardware potential.
Coordinating heterogeneous scenarios such as traffic tides, MoE evolution and multi‑model, multi‑chip environments.
Architecture – Service‑Engine Decoupling
Oxygen xLLM is the first framework that separates the Service layer (xLLM‑Service) from the Engine layer (xLLM‑Engine) .
Service Layer (xLLM‑Service)
Unified elastic scheduling for online and offline tasks, balancing SLO guarantees with cluster utilisation.
Dynamic PD (pre‑deployment) separation to absorb traffic spikes.
Global KV cache and rapid fault recovery to ensure large‑scale production availability.
Engine Layer (xLLM‑Engine)
Multi‑level pipelines that fully overlap computation and communication.
Adaptive graph mode with efficient memory management to handle dynamic inputs and GPU memory allocation.
Specialised optimisations for MoE, speculative decoding, generative recommendation and other scenarios, unlocking hardware potential.
The framework natively supports GPU, NPU and MLU . A unified inference abstraction layer masks hardware and model differences, enabling LLM, VLM, DiT, text‑to‑image/video and generative‑recommendation models to run on a mix of domestic chips.
Technical Highlights
Architectural Innovation – Service‑engine decoupling allows independent evolution of scheduling and compute while delivering synergistic performance gains.
Performance Breakthrough – Multi‑level pipelines, adaptive graph mode and dynamic PD separation significantly improve throughput and resource utilisation under strict SLO constraints, surpassing existing state‑of‑the‑art inference frameworks.
Heterogeneous Unification – The abstraction layer supports LLM, VLM, DiT, multimodal generation and recommendation models across multiple domestic chips, filling the “heterogeneous chip unified inference” gap.
High‑Availability Guarantees – Global KV cache management, distributed fast fault recovery, health monitoring and automatic inspection protect stable large‑scale production.
Domestic‑Chip Adaptation – A single framework covers a variety of domestic chips, lowering the barrier for国产化 deployment.
Industrial‑Scale Validation
Real‑world deployments demonstrate concrete gains:
E‑commerce customer‑service models : cluster utilisation increased by >35 %, P99 latency reduced by 28 % during peak traffic.
Power‑inspection workloads : inspection efficiency grew ~3×, outage‑rate fell 30 %, emergency‑repair efficiency improved 20 %.
Public‑safety edge inference : inspection speed rose 227 %, concurrent capacity improved 127 %, time‑to‑first‑token (TTFT) cut by 50 %.
Community Adoption
Since open‑sourcing, the project has attracted >1.4k GitHub stars and 235 forks. Major domestic chip vendors and large‑model providers have become core contributors and sponsors.
Roadmap and Ecosystem Plan
After joining OpenAtom, the project will focus on three areas:
Co‑building an ecosystem with model, chip and cloud partners to create a “chip + framework + solution” stack.
Promoting the proven capabilities to additional industries.
Driving the creation of standards for large‑model inference engines to accelerate domestic adoption.
Planned milestones:
By 2026: full multimodal support (text‑to‑image/video, Omini), comprehensive adaptation to mainstream domestic chips, launch of an enterprise‑grade commercial service, and expansion of the contributor community to ~200 developers.
From 2027 onward: deep industry penetration and establishment of Oxygen xLLM as the de‑facto standard for domestic‑chip model inference.
Repository
GitHub: https://github.com/jd-opensource/xllm
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
