Tagged articles
100 articles
Page 1 of 1
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 13, 2026 · Cloud Native

Boosting Autonomous Driving Data Pipelines with Koordinator’s ElasticQuota and GPU Sharing

This article details how a leading autonomous‑driving company tackled multi‑tenant resource contention, low GPU utilization, and distributed task dead‑locks on a heterogeneous Kubernetes cluster by adopting Koordinator’s ElasticQuota, Reservation, Gang and Device‑Share features, achieving higher allocation rates, better fairness, and significantly improved GPU efficiency.

ElasticQuotaGPU SharingKoordinator
0 likes · 20 min read
Boosting Autonomous Driving Data Pipelines with Koordinator’s ElasticQuota and GPU Sharing
dbaplus Community
dbaplus Community
Feb 9, 2026 · Artificial Intelligence

How EffectiveGPU Cuts GPU Costs with Fine‑Grained Partitioning and Volcano Scheduling

This article details how SF Tech's EffectiveGPU (EGPU) platform redesigns GPU resource management on Kubernetes, introducing fine‑grained memory and compute partitioning, priority‑based scheduling, Volcano integration, and monitoring pipelines to dramatically improve utilization and reduce hardware costs for AI workloads.

AI PlatformGPUGPU partitioning
0 likes · 23 min read
How EffectiveGPU Cuts GPU Costs with Fine‑Grained Partitioning and Volcano Scheduling
MaGe Linux Operations
MaGe Linux Operations
Sep 8, 2025 · Big Data

Build Enterprise‑Grade HDFS HA and Optimize YARN Scheduling from Scratch

This comprehensive guide walks you through constructing a fault‑tolerant HDFS high‑availability architecture, configuring dual NameNodes with ZooKeeper and JournalNode clusters, fine‑tuning YARN resource schedulers, implementing monitoring, automated failover testing, and performance optimization, all backed by real‑world production experiences and code examples.

Big Data OperationsHDFSYARN
0 likes · 24 min read
Build Enterprise‑Grade HDFS HA and Optimize YARN Scheduling from Scratch
Kuaishou Tech
Kuaishou Tech
Aug 21, 2025 · Artificial Intelligence

How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%

SeamlessFlow, an industrial‑scale reinforcement‑learning training framework released by Kuaipilot, decouples trainer and agents via a novel data‑plane, introduces a tag‑based resource scheduler, and eliminates pipeline bubbles, achieving up to 100% token‑throughput boost and 62% reduction in overall training time across large‑model RL workloads.

Distributed Trainingpipeline optimizationreinforcement learning
0 likes · 13 min read
How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%
Alibaba Cloud Native
Alibaba Cloud Native
Jul 25, 2025 · Cloud Native

How Apache RocketMQ Evolves into an AI‑Optimized Messaging Engine

The article explains how Apache RocketMQ has been re‑engineered for the AI era, addressing long‑running conversational workloads, scarce GPU resources, and multi‑agent workflow bottlenecks by introducing lightweight Lite‑Topic communication, intelligent resource scheduling, and cloud‑native architectural upgrades.

AI MessagingApache RocketMQLite-Topic
0 likes · 18 min read
How Apache RocketMQ Evolves into an AI‑Optimized Messaging Engine
Youzan Coder
Youzan Coder
Jul 18, 2025 · Cloud Native

How Mixed Workloads Boost Kubernetes CPU Utilization by Over 40%

This article explains how Youzan transformed its Kubernetes clusters from static over‑commit scheduling to load‑balanced mixed workloads using Koordinator and the Longxi kernel, achieving higher CPU utilization, lower costs, and better resource management for both online and offline services.

Big DataCloud NativeKoordinator
0 likes · 10 min read
How Mixed Workloads Boost Kubernetes CPU Utilization by Over 40%
High Availability Architecture
High Availability Architecture
Jul 7, 2025 · Artificial Intelligence

How TencentOS Server Is Redefining AI‑Ready Operating Systems

In a detailed interview, Tencent Cloud OS chief architect Du Zhen explains how TencentOS Server has evolved over 15 years from an internal platform to a multi‑industry, AI‑optimized operating system, outlining its OS‑for‑AI and AI‑for‑OS strategies, performance‑focused scheduling innovations, SWAP redesign, migration solutions, ecosystem building, and future vision.

AICloud NativeEcosystem
0 likes · 21 min read
How TencentOS Server Is Redefining AI‑Ready Operating Systems
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Jul 3, 2025 · Operations

Why Good Production Planning Beats Simple Scheduling: Mastering Resources, Rhythm, and Risk

This article explains how effective production planning goes beyond task ordering to coordinate resources, align market‑production‑supply rhythms, and manage risks, offering a four‑step framework—forecasting, scheduling, collaboration, and system mechanisms—to achieve stable, value‑driven manufacturing outcomes.

ERPSupply Chainmanufacturing operations
0 likes · 9 min read
Why Good Production Planning Beats Simple Scheduling: Mastering Resources, Rhythm, and Risk
Alimama Tech
Alimama Tech
Feb 12, 2025 · Artificial Intelligence

HighService: A High‑Performance Pythonic AI Service Framework for Model Inference and Global Resource Scheduling

HighService, Alibaba’s Pythonic AI service framework, accelerates large‑model inference and maximizes GPU utilization by separating CPU‑GPU processes, offering out‑of‑the‑box quantization, parallelism and caching, and dynamically reallocating idle GPUs across clusters through a master‑worker scheduler to keep online latency low while boosting offline throughput for diffusion and LLM workloads.

AI ServiceDistributed SystemsPython
0 likes · 16 min read
HighService: A High‑Performance Pythonic AI Service Framework for Model Inference and Global Resource Scheduling
DataFunSummit
DataFunSummit
Feb 6, 2025 · Big Data

Migrating Big Data Workloads to Cloud‑Native Kubernetes: Challenges, Solutions, and Lessons from OPPO

This article describes how OPPO's big‑data team transitioned from traditional IDC and EMR environments to a cloud‑native Kubernetes architecture, detailing the motivations, design principles, elastic scaling challenges, custom solutions, and future directions for large‑scale data processing on the cloud.

Cloud NativeKuberneteselastic scaling
0 likes · 18 min read
Migrating Big Data Workloads to Cloud‑Native Kubernetes: Challenges, Solutions, and Lessons from OPPO
Baidu Geek Talk
Baidu Geek Talk
Feb 5, 2025 · Artificial Intelligence

How to Unlock Full GPU Efficiency for Enterprise AI Platforms

This article analyzes common GPU efficiency problems in enterprise AI compute platforms—such as low utilization, long fault‑resolution times, and limited performance gains—and presents three practical solutions: dynamic resource allocation, systematic fault‑tolerance, and system‑level tuning, illustrated with real‑world case studies.

AI PlatformGPU utilizationlarge model training
0 likes · 11 min read
How to Unlock Full GPU Efficiency for Enterprise AI Platforms
Bilibili Tech
Bilibili Tech
Jan 24, 2025 · Operations

Design and Implementation of a CDN Edge‑Node Scheduling System for Bilibili Live Streaming

The paper presents Bilibili’s multi‑layer CDN edge‑node scheduling system, which groups heterogeneous nodes by quality and price, uses cost‑aware and resource‑aware heuristics—including maximum‑flow regional borrowing and contextual‑bandit utilization prediction—to allocate bandwidth per business, achieving a 43 % bandwidth reuse increase, 33 % coverage boost, and markedly lower stall rates.

BilibiliCDNCost Optimization
0 likes · 10 min read
Design and Implementation of a CDN Edge‑Node Scheduling System for Bilibili Live Streaming
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jan 16, 2025 · Cloud Native

Xiaohongshu Large-Scale Cloud-Native Mixed Deployment and Elasticity Practices

Xiaohongshu’s cloud‑native team transformed its over‑90% containerized services by introducing resource‑pooled mixed deployment, fine‑grained unified scheduling, and an elastic container pool with global HPA and cluster autoscaling—driving 35% of resources to mixed use, tens of millions of daily core‑hours, and roughly 30% cost savings while preparing for hybrid‑cloud expansion and FinOps.

Operating Systemcloud-nativecontainerization
0 likes · 7 min read
Xiaohongshu Large-Scale Cloud-Native Mixed Deployment and Elasticity Practices
AntTech
AntTech
Nov 22, 2024 · Cloud Native

Large-Scale Cloud‑Edge Collaborative Key Technologies and Applications Based on Cloud‑Native Architecture Wins Zhejiang Province 2023 Scientific and Technological Progress Award

The award‑winning cloud‑native large‑scale cloud‑edge collaborative project, developed by Alipay, Zhejiang University, Xieyun Technology and Alibaba Cloud, delivers unified resource scheduling for millions of heterogeneous devices, achieving significant performance gains, extensive patents, papers, standards, and substantial economic benefits across multiple industries.

AlipayZhejiang awardcloud-native
0 likes · 4 min read
Large-Scale Cloud‑Edge Collaborative Key Technologies and Applications Based on Cloud‑Native Architecture Wins Zhejiang Province 2023 Scientific and Technological Progress Award
AsiaInfo Technology: New Tech Exploration
AsiaInfo Technology: New Tech Exploration
Nov 15, 2024 · Artificial Intelligence

How PaaS for AI Optimizes Large‑Model Workloads on Kubernetes

This article analyzes the three core technologies behind PaaS for AI—GPU resource management, node data optimization, and task scheduling—detailing their concepts, component architecture, critical workflows, technical advantages, and future challenges, while illustrating practical configurations with Kubernetes and Volcano examples.

AIBig DataCloud Native
0 likes · 16 min read
How PaaS for AI Optimizes Large‑Model Workloads on Kubernetes
Architects' Tech Alliance
Architects' Tech Alliance
Oct 23, 2024 · Cloud Computing

NVIDIA vGPU vs AMD MxGPU: Architecture, Scheduling, and Virtualization Trade‑offs

This article explains GPU virtualization, comparing NVIDIA's software‑based vGPU and AMD's hardware‑based MxGPU, detailing their architecture, required hardware, licensing, performance indicators, resource scheduling strategies, slicing limits, and the advantages and drawbacks of each approach for virtualized workloads.

AMD MxGPUGPU virtualizationNVIDIA vGPU
0 likes · 12 min read
NVIDIA vGPU vs AMD MxGPU: Architecture, Scheduling, and Virtualization Trade‑offs
Cloud Native Technology Community
Cloud Native Technology Community
Aug 28, 2024 · Cloud Native

Kubernetes 1.31 Introduces the Alpha ‘distribute-cpus-across-cores’ Option in CPUManager Static Policy

Kubernetes 1.31 adds an alpha‑stage ‘distribute-cpus-across-cores’ option to the CPUManager static policy, allowing CPUs to be spread across physical cores for better cache locality, reduced contention, and improved performance in multi‑core and performance‑sensitive workloads.

CPUManagerCloud NativeKubernetes
0 likes · 7 min read
Kubernetes 1.31 Introduces the Alpha ‘distribute-cpus-across-cores’ Option in CPUManager Static Policy
Alibaba Cloud Native
Alibaba Cloud Native
Jan 16, 2024 · Cloud Native

What’s New in Koordinator v1.4.0? A Deep Dive into Mixed‑Workload Scheduling and Resource Optimizations

Koordinator v1.4.0 introduces mixed K8s/YARN workloads, NUMA‑aware scheduling, CPU‑normalization, enhanced ElasticQuota with tree structures and non‑preemptible pods, cold‑memory reporting, QoS for non‑containerized applications, and a suite of bug‑fixes and performance improvements for enterprise Kubernetes clusters.

CPU normalizationElasticQuotaKoordinator
0 likes · 24 min read
What’s New in Koordinator v1.4.0? A Deep Dive into Mixed‑Workload Scheduling and Resource Optimizations
Architecture & Thinking
Architecture & Thinking
Jan 14, 2024 · Artificial Intelligence

How Baidu Scales Content Understanding to Trillions of Pages with AI Engineering

This article explains how Baidu processes internet‑scale content by applying deep AI‑driven understanding, detailing cost‑optimization, efficiency improvements, model‑service frameworks, resource‑scheduling systems, and batch‑compute platforms that together enable trillion‑level indexing and feature extraction.

AI EngineeringBatch ComputingHTAP storage
0 likes · 16 min read
How Baidu Scales Content Understanding to Trillions of Pages with AI Engineering
360 Smart Cloud
360 Smart Cloud
Jan 10, 2024 · Cloud Native

Mixed Workload Scheduling (混部) in Kubernetes: Challenges, Core Technologies, and Koordinator Enhancements

The article analyzes low CPU utilization in pure online Kubernetes clusters, introduces mixed‑workload (online + offline) scheduling to improve resource efficiency, explains core techniques, kernel QoS features, and details Koordinator‑based implementations such as node resource reservation and scheduling adjustments.

Cloud NativeKoordinatorKubernetes
0 likes · 13 min read
Mixed Workload Scheduling (混部) in Kubernetes: Challenges, Core Technologies, and Koordinator Enhancements
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Nov 27, 2023 · Cloud Native

Mixed-Workload Scheduling and Resource Utilization Optimization in Xiaohongshu's Cloud-Native Platform

Xiaohongshu’s cloud‑native platform adopted a four‑stage mixed‑workload scheduling strategy—reusing idle nodes, whole‑machine time‑sharing, normal mixed pools, and a unified scheduler (Tusker) that coordinates CPU, GPU and memory across Kubernetes and YARN—boosting average cluster CPU utilization from under 20 % to over 45 % and delivering millions of low‑cost core‑hours while preserving QoS for latency‑sensitive, mid, and batch jobs.

Big DataKubernetesQoS
0 likes · 19 min read
Mixed-Workload Scheduling and Resource Utilization Optimization in Xiaohongshu's Cloud-Native Platform
Huolala Tech
Huolala Tech
Nov 23, 2023 · Artificial Intelligence

How HuoLaLa Built a Custom ASR System to Boost Accuracy and Cut Costs

This article details HuoLaLa's development of an in‑house Automatic Speech Recognition system, covering its architecture, VAD optimization, language‑model and hot‑word enhancements, punctuation restoration, task and resource scheduling, and the resulting improvements in accuracy and cost efficiency.

ASRLanguage ModelVAD
0 likes · 18 min read
How HuoLaLa Built a Custom ASR System to Boost Accuracy and Cut Costs
Baidu Geek Talk
Baidu Geek Talk
Nov 20, 2023 · Operations

How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights

This article details Baidu Search's engineering practice for trillion‑scale content understanding, covering cost and efficiency challenges, model‑service framework, batch‑compute platform, resource‑scheduling system, HTAP storage design, and concrete optimization techniques such as multi‑process Python serving, dynamic batching, and two‑stage scheduling.

BaiduBig DataHTAP
0 likes · 18 min read
How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights
iQIYI Technical Product Team
iQIYI Technical Product Team
Nov 17, 2023 · Big Data

Mixed Workload Co-location of Big Data and Online Services at iQIYI: Design, Implementation, and Results

iQIYI’s mixed‑workload system colocates Spark/Hive big‑data jobs with online video services by running YARN NodeManagers inside Kubernetes, using an Elastic YARN Operator, Koordinator‑driven CPU oversubscription, and remote shuffle, boosting online CPU utilization from ~9 % to over 40 % and saving tens of millions of RMB annually.

Big DataCloud NativeKubernetes
0 likes · 19 min read
Mixed Workload Co-location of Big Data and Online Services at iQIYI: Design, Implementation, and Results
Didi Tech
Didi Tech
Oct 19, 2023 · Cloud Native

Design and Implementation of a New Tiered Resource Guarantee System for Elastic Cloud Containers

The new tiered resource‑guarantee system for Didi’s elastic cloud containers defines S, A, and B priority levels with explicit over‑commit rules, upgrades OS, Kubernetes, kube‑odin, service‑tree, and CMP components, and thereby cuts CPU contention by up to 80%, reduces latency, improves scaling reliability, and lowers operational costs.

Container ManagementKubernetesOvercommit
0 likes · 16 min read
Design and Implementation of a New Tiered Resource Guarantee System for Elastic Cloud Containers
Didi Tech
Didi Tech
Oct 12, 2023 · Cloud Computing

Elastic Cloud Mixed Deployment: Architecture, Scheduling, Isolation, and Future Directions

Didi's Elastic Cloud uses mixed deployment to co‑locate diverse services, employing tiered guarantees, custom Kubernetes scheduling, profiling, rescheduling, and isolation‑cluster techniques to boost utilization while preserving QoS, with a roadmap for broader automation and interference detection.

Dynamic Scalingmixed deploymentperformance isolation
0 likes · 25 min read
Elastic Cloud Mixed Deployment: Architecture, Scheduling, Isolation, and Future Directions
DataFunSummit
DataFunSummit
Aug 25, 2023 · Big Data

Big Data Meets Cloud Native: Tencent's Cloud‑Native Big Data Architecture, Challenges, and Practices

This article explores how Tencent integrates big data with cloud‑native technologies, detailing the evolution, opportunities, challenges, the peak‑range (FengLuan) architecture, engine and scheduling layers, mixed‑workload strategies, runtime optimizations, and future directions for large‑scale data platforms.

Cloud NativeTencentdistributed computing
0 likes · 17 min read
Big Data Meets Cloud Native: Tencent's Cloud‑Native Big Data Architecture, Challenges, and Practices
High Availability Architecture
High Availability Architecture
May 26, 2023 · Big Data

Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster Resource Scheduling

This article introduces Amiya, a self‑developed overcommit component that dynamically increases Yarn memory and vCore capacity on Bilibili's offline big‑data clusters, details its architecture, key implementation of overcommit, eviction and mixed‑deployment strategies, and evaluates its resource‑utilization impact.

Cluster ManagementOvercommitResource Optimization
0 likes · 22 min read
Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster Resource Scheduling
DataFunTalk
DataFunTalk
May 25, 2023 · Artificial Intelligence

Optimizing Distributed Cache for Large-Scale Deep Learning Training with Alluxio and SiloD

This article examines the storage bottlenecks in large‑scale AI training, evaluates local‑disk and Alluxio‑based distributed caching strategies, proposes uniform cache eviction and replica‑aware global policies, and introduces the SiloD framework for coordinated compute‑storage scheduling to dramatically improve GPU utilization and overall cluster throughput.

AI trainingAlluxioCache Eviction
0 likes · 16 min read
Optimizing Distributed Cache for Large-Scale Deep Learning Training with Alluxio and SiloD
Baidu Geek Talk
Baidu Geek Talk
Mar 29, 2023 · Cloud Native

Punica: A Cloud‑Native Platform for Content Understanding Inference Services

Punica provides a cloud‑native, one‑stop platform that unifies Baidu’s content‑understanding inference services, automates testing, resource provisioning, and monitoring, and enables unattended, self‑healing operations with dynamic scaling and GPU scheduling, cutting onboarding time by half and reclaiming hundreds of GPUs.

AI inferenceInference PlatformService Orchestration
0 likes · 14 min read
Punica: A Cloud‑Native Platform for Content Understanding Inference Services
Baidu Geek Talk
Baidu Geek Talk
Feb 24, 2023 · Cloud Native

Design and Resource Scheduling of Cloud‑Native AI and the PaddleFlow Workflow Engine

The article explains Baidu’s cloud‑native AI resource scheduling across single‑ and multi‑GPU nodes, describes the PaddleFlow Kubernetes‑based workflow engine with its hierarchical queues, advanced scheduling algorithms, unified storage, and how these technologies improve GPU utilization, reduce fragmentation, and simplify AI task orchestration.

AIKubernetesPaddleFlow
0 likes · 23 min read
Design and Resource Scheduling of Cloud‑Native AI and the PaddleFlow Workflow Engine
Tencent Advertising Technology
Tencent Advertising Technology
Feb 17, 2023 · Big Data

Cost Optimization and Mixed‑Resource Deployment in Tencent's Taiji Machine Learning Platform

The article details how Tencent's Taiji machine‑learning platform reduces training costs and improves efficiency for large‑scale advertising models by leveraging cloud‑native mixed‑resource strategies—including online idle, offline elastic, and compute‑resource sharing—while maintaining high service stability through advanced scheduling, fault‑tolerance, and resource‑prediction techniques.

Big DataCloud NativeMachine Learning Platform
0 likes · 16 min read
Cost Optimization and Mixed‑Resource Deployment in Tencent's Taiji Machine Learning Platform
AntTech
AntTech
Jan 17, 2023 · Cloud Computing

Insights on Green Computing: Challenges, Trends, and Solutions from Ant Group and Academia

The interview explores the rapid rise of green computing, examining energy consumption of data centers, low CPU utilization, software‑centric optimization, cloud‑native scheduling, AI and big‑data workloads, and future technical and educational efforts needed to achieve sustainable, low‑carbon computing at scale.

AIdata center energygreen computing
0 likes · 20 min read
Insights on Green Computing: Challenges, Trends, and Solutions from Ant Group and Academia
Volcano Engine Developer Services
Volcano Engine Developer Services
Dec 15, 2022 · Cloud Native

How ByteDance Scaled Cloud‑Native Infrastructure: Lessons in Multi‑Cluster Scheduling

ByteDance’s cloud‑native transformation details a layered technical system, multi‑year Kubernetes‑based evolution, unified multi‑cluster resource management, and hierarchical scheduling, illustrating how the company achieves high development speed, resource efficiency, and prepares for next‑generation serverless infrastructure.

Cloud NativeDevOpsKubernetes
0 likes · 21 min read
How ByteDance Scaled Cloud‑Native Infrastructure: Lessons in Multi‑Cluster Scheduling
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 14, 2022 · Artificial Intelligence

How Cloud‑Native AI Boosts Resource Efficiency with PaddleFlow

This article explains how cloud‑native AI leverages container‑based architectures and advanced scheduling algorithms—such as resource queues, gang scheduling, bin‑packing, GPU topology‑aware and Tor‑aware dispatch—to improve resource and engineering efficiency, and introduces Baidu’s AI workflow engine PaddleFlow with its design, features, and deployment options.

AI workflowCloud Native AIGPU virtualization
0 likes · 25 min read
How Cloud‑Native AI Boosts Resource Efficiency with PaddleFlow
DataFunTalk
DataFunTalk
Dec 14, 2022 · Big Data

Cloud‑Native Big Data Solutions for the Financial Industry: Architecture, Deployment, Scheduling, and Resource Management

This article explains why the financial sector is moving its big‑data workloads to cloud‑native platforms, compares cloud‑native systems with traditional Hadoop, describes deployment options such as Serverless YARN and Arcee Operator, and details the high‑performance GRO scheduler, agent, and ResLake resource‑lake architecture that together improve resource utilization, reduce costs, and ensure reliable, low‑latency processing for finance workloads.

Big DataCloud Nativeresource scheduling
0 likes · 19 min read
Cloud‑Native Big Data Solutions for the Financial Industry: Architecture, Deployment, Scheduling, and Resource Management
DataFunSummit
DataFunSummit
Dec 10, 2022 · Big Data

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

This presentation details how Guanyuan Data leverages Apache Spark within its self‑service analytics platform, covering product features, flexible deployment, resource isolation, performance challenges, architectural solutions, and future cloud‑native enhancements to support thousands of users and massive query workloads.

Apache SparkBig DataData Platform
0 likes · 14 min read
Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions
Volcano Engine Developer Services
Volcano Engine Developer Services
Nov 28, 2022 · Cloud Native

How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok

ByteDance’s cloud‑native computing team, led by Li Yakun, details how they transformed a Hadoop‑centric big‑data stack into a Kubernetes‑driven platform—customizing storage, middleware, and scheduling—to support petabyte‑scale workloads, achieve over 40% resource utilization, and sustain rapid product growth.

Big DataCloud NativeSpark
0 likes · 17 min read
How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 20, 2022 · Cloud Native

Why Kubernetes Remains Complex and How Serverless Designs Aim to Simplify It

The article examines the inherent and accidental complexities of Kubernetes as a distributed cluster manager, discusses challenges in resource scheduling, infrastructure diversity, and operational overhead, and explores how cloud‑native solutions such as managed services, nodeless and serverless Kubernetes architectures attempt to reduce these complexities while introducing new trade‑offs.

Cloud NativeKubernetesOperations
0 likes · 18 min read
Why Kubernetes Remains Complex and How Serverless Designs Aim to Simplify It
vivo Internet Technology
vivo Internet Technology
Oct 9, 2022 · Artificial Intelligence

vivo Machine Learning Platform: Architecture Design and Practice

vivo’s machine‑learning platform, built for its massive app‑store and e‑commerce ecosystem, streamlines data processing, model training, and deployment through quota‑based resource management, a custom ultra‑large‑scale TensorFlow‑vlps framework, OpenAPI‑driven training, and Jupyter‑integrated interactive development, boosting efficiency for billions of samples and features.

Distributed TrainingMLOpsMachine Learning Platform
0 likes · 12 min read
vivo Machine Learning Platform: Architecture Design and Practice
DataFunSummit
DataFunSummit
Sep 25, 2022 · Big Data

Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

This article shares Xiaomi's internal practices of Hadoop YARN, covering scheduling and resource optimization, elastic scheduling, node overcommit handling, federation architecture, metadata warehouse construction, and future plans to improve cluster utilization and cost efficiency.

Big DataHadoopYARN
0 likes · 20 min read
Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi
Bilibili Tech
Bilibili Tech
Aug 27, 2022 · Cloud Native

Mixed Workload Co-location Practices in Bilibili's Kubernetes Cloud Platform

Bilibili’s Kubernetes cloud platform boosts server utilization by co‑locating latency‑sensitive online services with batch‑oriented offline jobs on the same nodes, using custom schedulers, extended resources, dynamic CPU/memory isolation, and a management console, achieving average CPU usage around 35 % and significant cost savings.

Cloud NativeCo-locationKubernetes
0 likes · 17 min read
Mixed Workload Co-location Practices in Bilibili's Kubernetes Cloud Platform
Meituan Technology Team
Meituan Technology Team
Aug 11, 2022 · Cloud Native

LAR: Load Auto-Regulator System for Resource Utilization and Service Quality

The article analyzes Meituan’s self‑designed Load Auto‑Regulator (LAR), detailing its tiered resource‑pool architecture, dynamic load‑to‑static‑resource mapping, and QoS mechanisms that together raise data‑center CPU utilization by 5‑10% while keeping online service quality stable, and discusses its deployment in online and mixed‑workload scenarios.

Cloud NativeCluster ManagementKubernetes
0 likes · 28 min read
LAR: Load Auto-Regulator System for Resource Utilization and Service Quality
AntTech
AntTech
Aug 1, 2022 · Cloud Native

Green Computing Strategies and Cloud‑Native Architecture at the 2022 China Computing Power Conference

In his Ant Group presentation at the 2022 China Computing Power Conference, He Zhengyu outlined how cloud‑native upgrades, time‑slice scheduling, AI‑driven capacity prediction, and mixed online‑offline deployment have dramatically improved server utilization, cut carbon emissions, and driven open‑source contributions toward sustainable computing.

AI elasticitydata center efficiencygreen computing
0 likes · 10 min read
Green Computing Strategies and Cloud‑Native Architecture at the 2022 China Computing Power Conference
DataFunTalk
DataFunTalk
Jul 21, 2022 · Big Data

Large-Scale Offline‑Online Mixed Deployment at Huya: Architecture, Challenges, and Solutions

This article describes Huya's large‑scale offline‑online mixed deployment, detailing the low resource‑utilization problems, the time‑sharing and elastic scheduling solutions, the containerized architecture, multi‑datacenter isolation, heterogeneous resource handling, stability safeguards, and the resulting performance improvements and future directions.

Big DataHuyacontainerization
0 likes · 13 min read
Large-Scale Offline‑Online Mixed Deployment at Huya: Architecture, Challenges, and Solutions
Volcano Engine Developer Services
Volcano Engine Developer Services
Jul 12, 2022 · Cloud Native

How ByteDance Scales Cloud‑Native Infrastructure with Hybrid K8s Scheduling

ByteDance’s cloud‑native ecosystem combines a multi‑layered architecture, dynamic resource over‑provisioning control, hybrid online‑offline scheduling, and federated cluster management to boost container utilization from 23% to 63%, reduce costs by 40%, and support massive events like the 2021 Spring Festival Gala.

Cloud Nativehybrid deploymentlarge scale
0 likes · 16 min read
How ByteDance Scales Cloud‑Native Infrastructure with Hybrid K8s Scheduling
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
May 30, 2022 · Cloud Computing

How a Young Huawei Engineer Saved Millions with Cloud Resource Scheduling Optimization

This interview follows 25‑year‑old Huawei Cloud engineer Tong Hao, who leveraged his competition‑honed algorithm skills to develop a universal resource‑re‑scheduling solver that fills “Tetris‑like” gaps in data‑center capacity, cutting operational costs by tens of millions of yuan while advancing cloud security and intelligent scheduling.

Career Developmentalgorithm engineeringcloud computing
0 likes · 10 min read
How a Young Huawei Engineer Saved Millions with Cloud Resource Scheduling Optimization
JD Retail Technology
JD Retail Technology
Apr 27, 2022 · Industry Insights

How JD Achieves Seamless Stability During Massive Sales Events

The article reviews the Global Information System Stability Summit and JD's technical architect Li Junliang's detailed case study on the engineering practices, observability, chaos engineering, and resource‑scheduling innovations that enable JD’s e‑commerce platform to handle sales‑peak traffic that spikes hundreds of times over normal load.

Observabilitychaos engineeringe‑commerce
0 likes · 7 min read
How JD Achieves Seamless Stability During Massive Sales Events
ITPUB
ITPUB
Apr 27, 2022 · Artificial Intelligence

How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%

This article details the design and optimization of 58.com’s WPAI machine learning platform, covering background, training‑task scheduling, elastic inference scaling, offline‑online resource mixing, and model‑inference acceleration, and shows how these techniques collectively raised GPU usage by 51% and CPU usage by 38% while cutting costs.

AI PlatformGPU utilizationInference Acceleration
0 likes · 26 min read
How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%
Zuoyebang Tech Team
Zuoyebang Tech Team
Apr 26, 2022 · Cloud Native

How Serverless Kubernetes Virtual Nodes Cut Costs and Boost Scalability

Zhang's team at Zuoyebang details their journey to serverless Kubernetes virtual nodes, explaining how elastic scaling, fine-grained scheduling, and cost‑effective resource utilization transformed high‑peak online services, while addressing challenges in scheduling, observability, performance, and multi‑cloud resilience.

Cost OptimizationKubernetesServerless
0 likes · 11 min read
How Serverless Kubernetes Virtual Nodes Cut Costs and Boost Scalability
DataFunTalk
DataFunTalk
Feb 17, 2022 · Cloud Native

ByteDance's Cloud‑Native Transformation of Its Machine Learning Platform

This article explains how ByteDance redesigned its machine‑learning platform using cloud‑native principles, detailing motivations, the shift from Yarn to Kubernetes, the implementation of PS‑Worker and AllReduce frameworks, unified operators, heterogeneous resource scheduling, elastic training, and future directions for large‑scale AI workloads.

cloud-nativeelastic-trainingheterogeneous-compute
0 likes · 15 min read
ByteDance's Cloud‑Native Transformation of Its Machine Learning Platform
Amap Tech
Amap Tech
Jan 6, 2022 · Mobile Development

Three‑Year Full‑Chain Performance Optimization at Gaode Map: Strategies, Practices, and Results

Over three years Gaode Map halved overall latency by systematically identifying bottlenecks, applying reverse‑order targeted fixes, establishing forward‑order long‑term controls, and deploying adaptive resource scheduling, engine acceleration, H5 container enhancements, high‑performance components, and CI automation, resulting in sustainable core‑chain performance improvements and a better user experience.

EngineeringGaode MapMobile
0 likes · 18 min read
Three‑Year Full‑Chain Performance Optimization at Gaode Map: Strategies, Practices, and Results
Baidu Geek Talk
Baidu Geek Talk
Jan 5, 2022 · Cloud Native

Baidu Cloud‑Native Mixed Workload (Offline Co‑location) Technology Overview

Baidu’s mixed‑workload approach co‑locates offline batch jobs with latency‑sensitive online services on shared nodes, using a dynamic resource view, priority‑based scheduling, cpuset/NUMA isolation, eBPF policies, and predictive profiling, boosting CPU utilization above 40 % and saving billions of RMB in total cost of ownership.

KubernetesMixed Workloadcloud-native
0 likes · 17 min read
Baidu Cloud‑Native Mixed Workload (Offline Co‑location) Technology Overview
Baidu Tech Salon
Baidu Tech Salon
Dec 31, 2021 · Industry Insights

How Baidu Boosted CPU Utilization by Up to 80% with Offline Mixed‑Tenant Scheduling

This article analyzes Baidu's offline mixed‑tenant technology that combines online and offline workloads on the same physical servers, detailing the resource‑usage problems, dynamic resource views, priority schemes, isolation mechanisms, high‑performance scheduling, and future directions for cloud‑native clusters.

Cloud NativeKubernetescpu-utilization
0 likes · 18 min read
How Baidu Boosted CPU Utilization by Up to 80% with Offline Mixed‑Tenant Scheduling
Alibaba Terminal Technology
Alibaba Terminal Technology
Dec 22, 2021 · Mobile Development

How Gaode Map Halved Core Link Latency: A 3‑Year Mobile Performance Overhaul

Over three years Gaode Map continuously optimized its full‑link performance, cutting core‑link latency by half through a systematic three‑step approach—identifying bottlenecks, reverse‑specialized problem solving, and long‑term forward control—resulting in dramatically improved user experience across diverse device tiers.

Gaode MapPerformance Optimizationengineering process
0 likes · 18 min read
How Gaode Map Halved Core Link Latency: A 3‑Year Mobile Performance Overhaul
DataFunTalk
DataFunTalk
Nov 2, 2021 · Artificial Intelligence

Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training

The article outlines a technical exchange hosted by 58.com AI Lab and Tianjin University that discusses high‑efficiency AI computing, resource‑aware scheduling for both online inference and offline training, and methods to mitigate GPU under‑utilization and gray‑interference in distributed deep‑learning platforms.

AIGPU utilizationInference
0 likes · 4 min read
Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training
DataFunTalk
DataFunTalk
Jun 13, 2021 · Artificial Intelligence

GPU Virtual Sharing for AI Inference Services on Kubernetes

The article presents a GPU virtual‑sharing solution for AI inference workloads that isolates memory and compute resources via CUDA API interception, integrates with Kubernetes using the open‑source aliyun‑gpushare scheduler, and demonstrates doubled GPU utilization and minimal performance loss across multiple tests.

CUDAGPU virtualizationKubernetes
0 likes · 16 min read
GPU Virtual Sharing for AI Inference Services on Kubernetes
iQIYI Technical Product Team
iQIYI Technical Product Team
May 28, 2021 · Artificial Intelligence

iQIYI GPU Virtual Sharing for AI Inference: Architecture, Isolation, and Scheduling

iQIYI created a custom GPU‑virtual‑sharing system that intercepts CUDA calls to enforce per‑container memory limits, rewrites kernel launches for compute isolation, and integrates with a Kubernetes scheduler extender, allowing multiple AI inference containers to share a single V100 with minimal overhead and more than doubling overall GPU utilization.

AI inferenceCUDAGPU virtualization
0 likes · 16 min read
iQIYI GPU Virtual Sharing for AI Inference: Architecture, Isolation, and Scheduling
Alibaba Cloud Developer
Alibaba Cloud Developer
May 19, 2021 · Cloud Computing

How to Optimize Cloud Resource Scheduling After Migration

After migrating to the cloud, enterprises must evaluate resource scale, cost pressure, and staffing before deciding whether to build their own scheduling system, and can choose among ECS, Dedicated Host, or private pool solutions, each with specific advantages, drawbacks, and suitable scenarios.

Auto Scalingcapacity planningdedicated host
0 likes · 15 min read
How to Optimize Cloud Resource Scheduling After Migration
DataFunTalk
DataFunTalk
Mar 3, 2021 · Big Data

Kwai Scheduler: Scaling YARN for Ultra‑Large Clusters at Kuaishou

This article presents Kuaishou's large‑scale offline computing challenges and describes how the team customized YARN and built the Kwai scheduler to achieve multi‑threaded, pluggable resource scheduling for clusters of tens of thousands of nodes, supporting diverse workloads such as ETL, ad‑hoc queries, machine‑learning training, and real‑time Flink jobs.

Cluster OptimizationKwai SchedulerYARN
0 likes · 15 min read
Kwai Scheduler: Scaling YARN for Ultra‑Large Clusters at Kuaishou
DataFunSummit
DataFunSummit
Feb 4, 2021 · Artificial Intelligence

Full-Stack Machine Learning Platform: Architecture, Key Factors, and Implementation Details

This article examines the evolution of user data, computing power, and models, and presents the design principles, key architectural factors, and practical implementation techniques for building a full‑stack machine learning platform that supports large‑scale data processing, distributed training, and low‑latency online serving.

Big Data IntegrationMachine Learning Platformdata pipelines
0 likes · 15 min read
Full-Stack Machine Learning Platform: Architecture, Key Factors, and Implementation Details
Amap Tech
Amap Tech
Jan 15, 2021 · Mobile Development

Low-Cost Performance Optimization and Long-Term Control for Super Apps: Gaode Map Case Study

Gaode Map’s low‑cost, long‑term performance strategy for super‑apps combines an adaptive resource‑scheduling framework, full‑dimension monitoring, and closed‑loop control to cut startup time over 70%, shrink memory use 30% and binary size 20%, delivering up to three‑fold speed gains on low‑end devices while preserving development efficiency.

Gaode MapSuper Appmobile app
0 likes · 13 min read
Low-Cost Performance Optimization and Long-Term Control for Super Apps: Gaode Map Case Study
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
May 8, 2020 · Cloud Native

Design and Practices of Tongcheng‑Elong Container Platform: Cloud‑Native Architecture, Scheduling, and Resource Optimization

This article details Tongcheng‑Elong's journey from bare‑metal to a Kubernetes‑based cloud‑native platform, describing its architecture, the challenges of isolation, scheduling, resource utilization and promotion, and the engineering solutions—including custom scheduling, CPU binding, IP fixation, and over‑commit strategies—implemented to improve efficiency and reliability.

Kubernetescontainer platformresource scheduling
0 likes · 16 min read
Design and Practices of Tongcheng‑Elong Container Platform: Cloud‑Native Architecture, Scheduling, and Resource Optimization
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 23, 2020 · Cloud Native

How Alibaba Cloud Tackles Bursty Peak Loads with Container‑Based Hybrid Deployment

Alibaba Cloud’s award‑winning solutions address bursty peak‑load challenges by integrating container‑based hybrid deployment, intelligent scheduling, and resource isolation, enabling massive e‑commerce events, gene‑computing tasks, and national ticketing systems to achieve high performance, low cost, and near‑zero incremental investment.

Alibaba Cloudburst traffichybrid deployment
0 likes · 17 min read
How Alibaba Cloud Tackles Bursty Peak Loads with Container‑Based Hybrid Deployment
Tencent Cloud Developer
Tencent Cloud Developer
Jan 9, 2020 · Fundamentals

TencentOS Kernel: Tencent Cloud's Open-Source Server OS

Tencent Cloud has open‑sourced its server‑grade operating system kernel, TencentOS Kernel, which offers cloud‑optimized resource scheduling, enhanced container isolation, ARM64 hot‑patching, and performance‑security optimizations that boost CPU utilization and lower operating costs, extending the TencentOS family after the tiny IoT release.

Operating SystemsPerformance Optimizationopen-source
0 likes · 11 min read
TencentOS Kernel: Tencent Cloud's Open-Source Server OS
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 20, 2019 · Cloud Native

Meituan-Dianping Kubernetes Cluster Management and Optimization Practices

This article details Meituan-Dianping's evolution from custom Docker‑based scaling to a Kubernetes‑driven, cloud‑native cluster management platform (HULK), describing its architecture, scheduler enhancements, Kubelet modifications, and resource‑optimization strategies for large‑scale operations.

Cloud NativeCluster ManagementKubernetes
0 likes · 17 min read
Meituan-Dianping Kubernetes Cluster Management and Optimization Practices
Qunar Tech Salon
Qunar Tech Salon
Aug 22, 2019 · Big Data

Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler

This article details Meituan's experience optimizing the Hadoop YARN fair scheduler, covering background challenges, architectural components, resource abstractions, scheduling flow, performance metrics, a series of code‑level optimizations, stability strategies for production rollout, and future directions for large‑scale cluster scheduling.

Big DataFair SchedulerLoad Simulation
0 likes · 23 min read
Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler
JD Tech
JD Tech
Jan 17, 2019 · Operations

Technical Overview of JD's Archimedes Resource Scheduling System

The article presents a detailed technical analysis of JD's Archimedes project, describing its evolution from JDOS 2.0 to a large‑scale container scheduling platform that dramatically improves resource utilization, deployment speed, and cost efficiency across JD’s data centers.

AIBig DataJD
0 likes · 6 min read
Technical Overview of JD's Archimedes Resource Scheduling System
Liulishuo Tech Team
Liulishuo Tech Team
Nov 29, 2018 · Cloud Native

Building an Efficient Machine Learning Training Platform on Kubernetes

This article describes how the Liulishuo algorithm team designed and implemented a Kubernetes‑based training platform that addresses the iterative, data‑intensive, and resource‑dynamic characteristics of machine learning workloads by pooling resources, enabling rapid provisioning, and optimizing scheduling and storage.

Cloud NativeKubernetesmachine learning
0 likes · 9 min read
Building an Efficient Machine Learning Training Platform on Kubernetes
dbaplus Community
dbaplus Community
Nov 11, 2018 · Operations

How 360 Built an AI‑Powered Ops System to Cut Costs and Boost Efficiency

360’s AI‑ops team shares a year‑long journey of turning massive operational data into intelligent solutions—covering background, their AIOps philosophy, practical modules like capacity forecasting, host classification, resource reclamation, smart MySQL scheduling, anomaly detection, alarm reduction, and root‑cause analysis—to dramatically improve cost, efficiency, and reliability.

Capacity Forecastingaiopsanomaly detection
0 likes · 16 min read
How 360 Built an AI‑Powered Ops System to Cut Costs and Boost Efficiency
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 23, 2018 · Operations

Unlocking Resource Efficiency: Alibaba’s Mixed‑Deployment (Co‑location) Strategy

This article explains how Alibaba’s mixed‑deployment (co‑location) technology combines online transaction services and offline compute workloads on shared physical servers, detailing its architecture, scheduling mechanisms, resource‑concession strategies, achieved performance gains, and future directions for large‑scale e‑commerce infrastructure.

AlibabaCo-locationOperations
0 likes · 23 min read
Unlocking Resource Efficiency: Alibaba’s Mixed‑Deployment (Co‑location) Strategy
360 Tech Engineering
360 Tech Engineering
Sep 29, 2018 · Operations

Design and Implementation of a Multi‑Task Scheduling System Based on Apache Mesos

This article describes how we identified underutilized CPU and memory resources in our company's servers, evaluated Kubernetes versus Apache Mesos, and built a non‑intrusive, Mesos‑based multi‑task scheduling system with dynamic resource reservation, monitoring, task isolation, and cluster‑wide observability, while addressing deployment challenges.

Cluster ManagementDocker alternativeMesos
0 likes · 11 min read
Design and Implementation of a Multi‑Task Scheduling System Based on Apache Mesos
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 26, 2018 · Operations

How Scheduling Algorithms Power Efficient Data Center Resource Management

Scheduling algorithms are a crucial component of cluster resource management systems, determining where containerized tasks run to ensure resource needs, high availability, fault tolerance, and cost efficiency across individual containers, applications, and entire data centers, while also supporting Alibaba’s global scheduling challenge.

Cluster ManagementData centeralgorithm competition
0 likes · 10 min read
How Scheduling Algorithms Power Efficient Data Center Resource Management
Efficient Ops
Efficient Ops
Mar 6, 2018 · Cloud Computing

How Alibaba Cuts Costs by 30% with Co‑Location Scheduling (Mix‑Deploy)

This article explains Alibaba's co‑location (混部) technology that mixes online services and batch compute on the same physical servers, detailing its background, key characteristics, scheduling architecture, resource isolation mechanisms, cost‑saving formulas, and future roadmap, showing how it boosts utilization and reduces expenses.

AlibabaCo-locationcloud computing
0 likes · 15 min read
How Alibaba Cuts Costs by 30% with Co‑Location Scheduling (Mix‑Deploy)
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 5, 2018 · Operations

Boosting Test Environment Stability: Automated Container Replacement & Buffer Pools

This article analyzes the instability of Alibaba's test environment container provisioning, identifies root causes, and presents a comprehensive solution—including automatic container replacement, a buffer pool, and resource‑pool rationalization—that raised the container success rate to 99.9% and stabilized performance.

Operationsbuffer poolcontainer orchestration
0 likes · 9 min read
Boosting Test Environment Stability: Automated Container Replacement & Buffer Pools
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 12, 2018 · Operations

How Alibaba’s Co‑Location (Mixed‑Deployment) Cuts Costs and Boosts Utilization

Alibaba’s mixed‑deployment (Co‑location) technology combines online services and batch compute tasks on shared physical resources, using priority‑based scheduling, resource isolation, and dynamic memory management to dramatically improve CPU utilization, cut infrastructure costs, and maintain service level objectives during peak traffic events.

Co-locationCost Optimizationcloud infrastructure
0 likes · 16 min read
How Alibaba’s Co‑Location (Mixed‑Deployment) Cuts Costs and Boosts Utilization
Architecture Digest
Architecture Digest
Dec 23, 2017 · Cloud Computing

Design and Practices of an Elastic Computing Platform for Efficient Resource Utilization

This article describes the design, challenges, and operational practices of a cloud‑native elastic computing platform that reuses idle resources from production servers to support massive image compression, video transcoding, AI inference, and log processing while ensuring online services maintain performance, latency, and reliability.

OOM handlingPerformance Monitoringcloud infrastructure
0 likes · 13 min read
Design and Practices of an Elastic Computing Platform for Efficient Resource Utilization
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 22, 2017 · Cloud Computing

How Alibaba Scaled Double 11: Inside the Cloud Architecture Evolution

Over nine years, Alibaba transformed its Double 11 e‑commerce platform from a centralized system to a highly elastic, cloud‑native architecture, employing distributed design, multi‑active regions, unified scheduling, containerization with Pouch, and hybrid deployment to dramatically cut costs and boost peak throughput.

Alibabacontainerizationresource scheduling
0 likes · 19 min read
How Alibaba Scaled Double 11: Inside the Cloud Architecture Evolution
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 19, 2017 · Operations

How Alibaba’s TPP Intelligent Scheduler Boosts Resource Utilization and Handles Double‑11 Traffic

The article details Alibaba's Taobao Personalization Platform (TPP) intelligent scheduling system, explaining its architecture, optimization algorithms, convergence logic, and performance results that dramatically improve CPU utilization and automate scaling during both regular operation and high‑traffic events like Double‑11.

AlibabaAuto Scalingcloud operations
0 likes · 21 min read
How Alibaba’s TPP Intelligent Scheduler Boosts Resource Utilization and Handles Double‑11 Traffic
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 6, 2017 · Operations

How Mixed Offline/Online Scheduling Boosted Alibaba’s Data Center Utilization by 30%

The article explains how rapid internet growth has expanded data centers, why traditional operations fall short, presents a simple utilization formula, shows Alibaba’s mixed offline‑online scheduling experiment that raised server usage from 10% to over 40%, and announces an open dataset for academic research.

AlibabaCluster Managementdata center utilization
0 likes · 7 min read
How Mixed Offline/Online Scheduling Boosted Alibaba’s Data Center Utilization by 30%
21CTO
21CTO
Aug 19, 2017 · Cloud Computing

How Tencent’s Elastic Platform Powers Billions of Daily Image Compressions with 6K Containers

Tencent’s elastic computing platform replaces 24,000 physical servers with just 6,000 containers, delivering sustainable compute for billions of daily image compressions while also supporting video transcoding, Spark jobs, and AI workloads through dynamic resource isolation, named services, and intelligent scheduling.

cloud infrastructurecontainer orchestrationelastic computing
0 likes · 8 min read
How Tencent’s Elastic Platform Powers Billions of Daily Image Compressions with 6K Containers
High Availability Architecture
High Availability Architecture
Mar 7, 2017 · Cloud Native

Tencent Games’ 3‑Year Journey of Kubernetes Adoption and Optimization for Large‑Scale Online Gaming

This article details how Tencent Games built, customized, and continuously optimized a Kubernetes‑based container platform over three years to support tens of thousands of game containers, covering deployment modes, scheduler enhancements, network solutions, resource quotas, monitoring, storage, and the transition to micro‑service architectures.

Cloud NativeKubernetesMicroservices
0 likes · 19 min read
Tencent Games’ 3‑Year Journey of Kubernetes Adoption and Optimization for Large‑Scale Online Gaming
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 25, 2016 · Operations

Comparing Modern Data‑Center Schedulers: Borg, Mesos, Omega, Kubernetes & Zeus

This article examines resource allocation philosophies—auction, budgeting, and preemption—and compares the architectures, data models, and APIs of major schedulers such as Borg, Omega, Mesos, Kubernetes, and Alibaba’s Zeus, while also exploring sharing strategies, task classifications, utilization metrics, and predictive techniques for efficient resource management.

BorgCluster ManagementKubernetes
0 likes · 34 min read
Comparing Modern Data‑Center Schedulers: Borg, Mesos, Omega, Kubernetes & Zeus
Efficient Ops
Efficient Ops
Jul 26, 2016 · Cloud Computing

How China Mobile Zhejiang Built a Private Cloud with MESOS – Key Lessons

This article details China Mobile Zhejiang's journey from early virtualization to a full private‑cloud platform built on MESOS, covering why MESOS was chosen, the evolution of their cloud stages, DCOS implementation, automatic scaling, service discovery, and the operational benefits achieved.

DCOSMesosprivate cloud
0 likes · 23 min read
How China Mobile Zhejiang Built a Private Cloud with MESOS – Key Lessons
MaGe Linux Operations
MaGe Linux Operations
Jun 6, 2016 · Cloud Native

Borg, Omega, Mesos, Kubernetes vs Alibaba Zeus: Key Resource Scheduling Strategies

This article compares the resource allocation philosophies, architectural designs, data handling, and API models of Borg, Omega, Mesos, Kubernetes, and Alibaba's Zeus, discussing auction, budgeting, preemption, sharing models, task types, utilization, prediction, and practical implementation details for large‑scale cloud native environments.

Alibaba ZeusBorgKubernetes
0 likes · 34 min read
Borg, Omega, Mesos, Kubernetes vs Alibaba Zeus: Key Resource Scheduling Strategies
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
May 4, 2016 · Cloud Computing

Alibaba Zeus Resource Scheduling System: Architecture, Virtualization, and Operational Practices

The article examines Alibaba's Zeus resource scheduling platform, detailing its background, problem analysis, container‑based virtualization, distributed architecture, strategies for improving resource utilization such as overselling and hybrid deployment, as well as stability measures and automation for large‑scale operations.

AlibabaOperationscloud computing
0 likes · 12 min read
Alibaba Zeus Resource Scheduling System: Architecture, Virtualization, and Operational Practices
High Availability Architecture
High Availability Architecture
Apr 27, 2016 · Operations

Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies

This article explains the Mesos distributed system kernel, its master‑slave architecture, fine‑grained resource scheduling, and how Qunar leverages Mesos and Marathon for log processing, Spark, Alluxio, and multi‑tenant services while addressing framework unification, HA, service discovery, and operational challenges.

Cluster ManagementFrameworkMarathon
0 likes · 14 min read
Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies
Efficient Ops
Efficient Ops
Jun 17, 2015 · Cloud Native

Project Eru: Scaling a Custom Docker Orchestration Platform to 10k Nodes

Project Eru, a homegrown Docker‑based orchestration system developed at Mango TV, replaces earlier PaaS attempts with a stateless, scalable core and agent architecture, leveraging Redis clusters, MacVLAN networking, and fine‑grained CPU allocation to achieve rapid, automated scaling across thousands of containers.

Macvlancontainer orchestrationredis
0 likes · 22 min read
Project Eru: Scaling a Custom Docker Orchestration Platform to 10k Nodes