MLOps Implementation in Network Intelligence: Jiutian Platform Overview
This article presents the Jiutian Network Intelligence platform’s MLOps implementation at China Mobile, detailing its AI engineering workflow, platform functional and technical architecture, technology selections, model deployment, monitoring, and operational challenges, and shares insights on scaling AI services across 31 provinces.
Guide This sharing session is titled "MLOps in Network Intelligence Field Implementation" and covers the following topics:
Jiutian Network Intelligence Platform
AI Engineering Practice
MLOps Technology Selection
MLOps Reflections
Q&A
Speaker Liang Xiaoyang, CEO of Jiutian Network Intelligence Platform, China Mobile Communications Group Co., Ltd.
Editor Xu Jianfeng
Proofreader Li Yao
Community DataFun
1. Systematic AI Core Engine: Jiutian
China Mobile has established a dedicated AI team and built a comprehensive Jiutian AI product system, releasing eight platform products such as Jiutian Deep Learning Platform and Jiutian AI Capability Platform. The Deep Learning Platform encapsulates deep learning frameworks for modeling, while the AI Capability Platform provides generic AI services. Other products, like the Jiutian Network Intelligence Platform, are built on these foundations for specific domains (e.g., education, TV recommendation).
Based on these eight platform products, the Jiutian team has released over 300 core capabilities serving billions of users across more than 40 large‑scale applications, delivering value exceeding 4 billion RMB.
2. Network Intelligence
Background for network intelligence research at China Mobile includes:
Being the world’s largest mobile network operator.
Over 100,000 O&M personnel, leading to high operational costs.
5G network slicing driving specialized, personalized services.
“1+31” cloud‑edge requirements due to varying infrastructure and data security needs across provinces.
3. Jiutian Network Intelligence Platform
To address these needs, research began in 2017 with intelligent positioning, expanding to fault handling, optimization, scheduling, and knowledge graphs, resulting in over 20 AI capabilities. A capability store was built to allow AI services developed in one region to be subscribed elsewhere. The platform was piloted in Zhejiang in 2020 and commercialized in 2021, now supporting 31 provinces.
Challenges include fragmented development roles (algorithm, backend, frontend, ops) leading to long development cycles and lack of end‑to‑end responsibility.
4. Platform Functional Architecture
The platform adopts a DevOps‑inspired MLOps approach, consisting of a foundational layer (model training, inference, data labeling) and two logical layers: the platform base (Jiutian foundation) and platform services.
Data Factory: data security, governance, dataset management.
Capability Factory: capability development, inference, management.
AI Integrated Applications: application development, capability orchestration.
Support Services: data operations, user center.
Four main platform services:
Data Service – simplifies data usage across the nation.
Training Service – accelerates capability development via offline training + online refinement.
Inference Service – ensures efficient, stable calls and optimal resource utilization.
Management Service – handles capability deployment, promotion, and operation across regions.
5. Technical Architecture
The platform is built on a cloud‑native + Spring Cloud stack, layered from bottom to top as cloud platform, resource management, infrastructure, platform services, business services, and microservice management/user interface.
Cloud Platform – virtual machines, storage, high‑speed networking, CPU/GPU virtualization.
Resource Management – Kubernetes clusters, Pangu platform, Jiutian capability platform, exposing AI services via SPI.
Infrastructure – MySQL, MongoDB, Redis, Elasticsearch, etc.
Platform Services – authentication, ticketing, Web IDE, terminal, online testing, etc.
Business Services – lifecycle management (capability center, deployment, model, data, drift detection, inference feedback).
Microservice Management/User Interface – Nacos for service registration/discovery and gateway for request routing.
The platform follows DDD, separating capability (static) and instance (runtime) domains.
6. Deployment Architecture
Using a cloud‑edge collaborative deployment, a central node manages capability registration, while provincial edge clusters host the services. Users obtain access tokens via edge gateways, and operational data aggregates to the central node for unified monitoring.
7. AI Engineering Exploration
Key challenges in AI engineering include long development cycles (>3 months), siloed knowledge among roles, model non‑generality across provinces, delayed model optimization, and inconsistent cloud‑edge model versions.
Solutions involve adopting NVIDIA Triton for one‑click model publishing, internal training for cross‑role knowledge sharing, online model updates, and establishing operational standards for cloud‑edge consistency.
Additional explorations:
Front‑end for Triton model conversion, providing a graphical interface to hide framework differences.
Lightweight custom model packaging tool replacing heavy Triton usage.
Resource‑aware scheduling by profiling algorithm resource needs (CPU/GPU, memory) and matching them with node metadata via Nacos.
Separating inference code from model mounts on shared storage to enable province‑wide model reuse without downtime.
Regular cross‑team knowledge exchanges and coding standards to improve collaboration.
8. Path to Production with MLOps
MLOps standardizes the end‑to‑end pipeline: demand management, data engineering, model development, delivery, and operation. The platform follows the industry‑wide MLOps workflow.
9. MLOps Technology Selection
Decision points include open‑source vs. proprietary, platform vs. specific tools, monitoring, orchestration, training/debugging, and drift management.
Model packaging options considered:
AWS SageMaker – mature drift detection, training, version control.
Kubeflow – simplifies ML workflows on Kubernetes.
Seldon Core – unified deployment for multiple frameworks but heavyweight.
Chosen solution: custom Jiutian packaging + capability framework + pipeline services + data annotation + whylogs for monitoring.
Model monitoring/drift tools evaluated:
Alibi Detect – powerful but requires Seldon Core.
Evidently – lightweight, framework‑agnostic Python library.
whylogs – lightweight statistical drift detection, selected for AI engineering.
10. Final Technical Selection Workflow
Develop models in a custom Web IDE, package via the capability management service, expose as services through the platform gateway, collect runtime logs (inputs, outputs, metrics), store logs in a secure data sharing platform, fuse with user feedback, and continuously improve models via annotation and retraining.
11. MLOps Reflections
Key considerations: the pipeline must deliver capabilities from test to production without manual steps; pre‑existing manual processes should be streamlined before MLOps adoption.
12. Q&A
Q: How does the platform handle traffic spikes with MLOps‑enabled capabilities?
A: Monitoring includes inference and overall capability metrics; the gateway performs rate‑limiting and alerts, after which engineers trigger elastic scaling. MLOps‑enabled services must be stateless to support scaling.
Thank you for attending.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.