Cost Optimization and Mixed‑Resource Deployment in Tencent's Taiji Machine Learning Platform
The article details how Tencent's Taiji machine‑learning platform reduces training costs and improves efficiency for large‑scale advertising models by leveraging cloud‑native mixed‑resource strategies—including online idle, offline elastic, and compute‑resource sharing—while maintaining high service stability through advanced scheduling, fault‑tolerance, and resource‑prediction techniques.
In recent years, large‑scale models have become the standard paradigm for AI modeling, driving massive parameter counts and resource demands in advertising, search, and recommendation scenarios. Tencent Advertising developed two trillion‑parameter models, "HunYuan AI" and an advertising‑specific model, whose deployment relies on the underlying Taiji machine‑learning platform.
The Taiji platform provides end‑to‑end support for feature processing, model training, and serving, and now incorporates cost‑reduction measures that deliver 500,000 low‑cost mixed‑deployment cores daily, cutting offline training costs by 30% while keeping stability comparable to dedicated resources.
Training on Taiji supports both CPU and GPU modes, using custom operators, mixed‑precision, and 3D parallelism to achieve an order‑of‑magnitude speedup over open‑source systems. Inference is powered by the self‑developed Heterogeneous Computing Framework (HCF), which optimizes performance across hardware, compiler, and software layers.
Cost‑optimization is achieved through a mixed‑resource strategy called "FengLuan," a cloud‑native big‑data platform that supplies three types of mixed resources: online idle resources, elastic offline resources, and low‑priority compute resources. These resources are abstracted as virtual clusters, shielding downstream services from underlying heterogeneity.
Online idle resources are harvested from under‑utilized machines during peak‑off periods; elastic offline resources are borrowed during low‑load daytime windows and returned before peak hours; compute resources are provided as low‑priority CVMs that can be pre‑empted by higher‑priority workloads.
The Caelus mixed‑deployment system ensures quality of service for both online and offline jobs by detecting interference, isolating resources, and applying flexible eviction policies. It also supports hot migration of pods to minimize disruption during pre‑emptions.
To address the instability of low‑priority compute resources, Taiji employs resource profiling, predictive scheduling, city‑level and single‑node optimizations, hierarchical resource tagging, and dynamic parameter tuning, which together improve job performance by more than twofold.
At the application layer, a three‑tier fault‑tolerance strategy—hot migration, TaskManager restart, and full job recovery—raises the stability of jobs running on mixed resources from below 90% to over 99.5%, matching that of dedicated resources.
In practice, the mixed‑deployment solution supplies 300,000 cores and 200,000 tide‑resource cores daily to Tencent Advertising, reducing resource cost to 70% of ordinary provisioning while maintaining system stability. Future work will expand mixed compute usage, including GPU resources for offline training.
Tencent Advertising Technology
Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.