Tagged articles

100 articles

Page 1 of 1

Mar 13, 2026 · Cloud Native

Boosting Autonomous Driving Data Pipelines with Koordinator’s ElasticQuota and GPU Sharing

This article details how a leading autonomous‑driving company tackled multi‑tenant resource contention, low GPU utilization, and distributed task dead‑locks on a heterogeneous Kubernetes cluster by adopting Koordinator’s ElasticQuota, Reservation, Gang and Device‑Share features, achieving higher allocation rates, better fairness, and significantly improved GPU efficiency.

ElasticQuotaGPU SharingKoordinator

0 likes · 20 min read

Boosting Autonomous Driving Data Pipelines with Koordinator’s ElasticQuota and GPU Sharing

dbaplus Community

Feb 9, 2026 · Artificial Intelligence

How EffectiveGPU Cuts GPU Costs with Fine‑Grained Partitioning and Volcano Scheduling

This article details how SF Tech's EffectiveGPU (EGPU) platform redesigns GPU resource management on Kubernetes, introducing fine‑grained memory and compute partitioning, priority‑based scheduling, Volcano integration, and monitoring pipelines to dramatically improve utilization and reduce hardware costs for AI workloads.

AI PlatformGPUGPU partitioning

0 likes · 23 min read

How EffectiveGPU Cuts GPU Costs with Fine‑Grained Partitioning and Volcano Scheduling

MaGe Linux Operations

Sep 8, 2025 · Big Data

Build Enterprise‑Grade HDFS HA and Optimize YARN Scheduling from Scratch

This comprehensive guide walks you through constructing a fault‑tolerant HDFS high‑availability architecture, configuring dual NameNodes with ZooKeeper and JournalNode clusters, fine‑tuning YARN resource schedulers, implementing monitoring, automated failover testing, and performance optimization, all backed by real‑world production experiences and code examples.

Big Data OperationsHDFSYARN

0 likes · 24 min read

Build Enterprise‑Grade HDFS HA and Optimize YARN Scheduling from Scratch

Kuaishou Tech

Aug 21, 2025 · Artificial Intelligence

How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%

SeamlessFlow, an industrial‑scale reinforcement‑learning training framework released by Kuaipilot, decouples trainer and agents via a novel data‑plane, introduces a tag‑based resource scheduler, and eliminates pipeline bubbles, achieving up to 100% token‑throughput boost and 62% reduction in overall training time across large‑model RL workloads.

Distributed Trainingpipeline optimizationreinforcement learning

0 likes · 13 min read

How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%

Alibaba Cloud Native

Jul 25, 2025 · Cloud Native

How Apache RocketMQ Evolves into an AI‑Optimized Messaging Engine

The article explains how Apache RocketMQ has been re‑engineered for the AI era, addressing long‑running conversational workloads, scarce GPU resources, and multi‑agent workflow bottlenecks by introducing lightweight Lite‑Topic communication, intelligent resource scheduling, and cloud‑native architectural upgrades.

AI MessagingApache RocketMQLite-Topic

0 likes · 18 min read

How Apache RocketMQ Evolves into an AI‑Optimized Messaging Engine

Youzan Coder

Jul 18, 2025 · Cloud Native

How Mixed Workloads Boost Kubernetes CPU Utilization by Over 40%

This article explains how Youzan transformed its Kubernetes clusters from static over‑commit scheduling to load‑balanced mixed workloads using Koordinator and the Longxi kernel, achieving higher CPU utilization, lower costs, and better resource management for both online and offline services.

Big DataCloud NativeKoordinator

0 likes · 10 min read

How Mixed Workloads Boost Kubernetes CPU Utilization by Over 40%

High Availability Architecture

Jul 7, 2025 · Artificial Intelligence

How TencentOS Server Is Redefining AI‑Ready Operating Systems

In a detailed interview, Tencent Cloud OS chief architect Du Zhen explains how TencentOS Server has evolved over 15 years from an internal platform to a multi‑industry, AI‑optimized operating system, outlining its OS‑for‑AI and AI‑for‑OS strategies, performance‑focused scheduling innovations, SWAP redesign, migration solutions, ecosystem building, and future vision.

AICloud NativeEcosystem

0 likes · 21 min read

How TencentOS Server Is Redefining AI‑Ready Operating Systems

Old Zhao – Management Systems Only

Jul 3, 2025 · Operations

Why Good Production Planning Beats Simple Scheduling: Mastering Resources, Rhythm, and Risk

This article explains how effective production planning goes beyond task ordering to coordinate resources, align market‑production‑supply rhythms, and manage risks, offering a four‑step framework—forecasting, scheduling, collaboration, and system mechanisms—to achieve stable, value‑driven manufacturing outcomes.

ERPSupply Chainmanufacturing operations

0 likes · 9 min read

Why Good Production Planning Beats Simple Scheduling: Mastering Resources, Rhythm, and Risk

Alimama Tech

Feb 12, 2025 · Artificial Intelligence

HighService: A High‑Performance Pythonic AI Service Framework for Model Inference and Global Resource Scheduling

HighService, Alibaba’s Pythonic AI service framework, accelerates large‑model inference and maximizes GPU utilization by separating CPU‑GPU processes, offering out‑of‑the‑box quantization, parallelism and caching, and dynamically reallocating idle GPUs across clusters through a master‑worker scheduler to keep online latency low while boosting offline throughput for diffusion and LLM workloads.

AI ServiceDistributed SystemsPython

0 likes · 16 min read

HighService: A High‑Performance Pythonic AI Service Framework for Model Inference and Global Resource Scheduling

DataFunSummit

Feb 6, 2025 · Big Data

Migrating Big Data Workloads to Cloud‑Native Kubernetes: Challenges, Solutions, and Lessons from OPPO

This article describes how OPPO's big‑data team transitioned from traditional IDC and EMR environments to a cloud‑native Kubernetes architecture, detailing the motivations, design principles, elastic scaling challenges, custom solutions, and future directions for large‑scale data processing on the cloud.

Cloud NativeKuberneteselastic scaling

0 likes · 18 min read

Migrating Big Data Workloads to Cloud‑Native Kubernetes: Challenges, Solutions, and Lessons from OPPO

Baidu Geek Talk

Feb 5, 2025 · Artificial Intelligence

How to Unlock Full GPU Efficiency for Enterprise AI Platforms

This article analyzes common GPU efficiency problems in enterprise AI compute platforms—such as low utilization, long fault‑resolution times, and limited performance gains—and presents three practical solutions: dynamic resource allocation, systematic fault‑tolerance, and system‑level tuning, illustrated with real‑world case studies.

AI PlatformGPU utilizationlarge model training

0 likes · 11 min read

How to Unlock Full GPU Efficiency for Enterprise AI Platforms

Bilibili Tech

Jan 24, 2025 · Operations

Design and Implementation of a CDN Edge‑Node Scheduling System for Bilibili Live Streaming

The paper presents Bilibili’s multi‑layer CDN edge‑node scheduling system, which groups heterogeneous nodes by quality and price, uses cost‑aware and resource‑aware heuristics—including maximum‑flow regional borrowing and contextual‑bandit utilization prediction—to allocate bandwidth per business, achieving a 43 % bandwidth reuse increase, 33 % coverage boost, and markedly lower stall rates.

BilibiliCDNCost Optimization

0 likes · 10 min read

Design and Implementation of a CDN Edge‑Node Scheduling System for Bilibili Live Streaming

Xiaohongshu Tech REDtech

Jan 16, 2025 · Cloud Native

Xiaohongshu Large-Scale Cloud-Native Mixed Deployment and Elasticity Practices

Xiaohongshu’s cloud‑native team transformed its over‑90% containerized services by introducing resource‑pooled mixed deployment, fine‑grained unified scheduling, and an elastic container pool with global HPA and cluster autoscaling—driving 35% of resources to mixed use, tens of millions of daily core‑hours, and roughly 30% cost savings while preparing for hybrid‑cloud expansion and FinOps.

Operating Systemcloud-nativecontainerization

0 likes · 7 min read

Xiaohongshu Large-Scale Cloud-Native Mixed Deployment and Elasticity Practices

AntTech

Nov 22, 2024 · Cloud Native

Large-Scale Cloud‑Edge Collaborative Key Technologies and Applications Based on Cloud‑Native Architecture Wins Zhejiang Province 2023 Scientific and Technological Progress Award

The award‑winning cloud‑native large‑scale cloud‑edge collaborative project, developed by Alipay, Zhejiang University, Xieyun Technology and Alibaba Cloud, delivers unified resource scheduling for millions of heterogeneous devices, achieving significant performance gains, extensive patents, papers, standards, and substantial economic benefits across multiple industries.

AlipayZhejiang awardcloud-native

0 likes · 4 min read

Large-Scale Cloud‑Edge Collaborative Key Technologies and Applications Based on Cloud‑Native Architecture Wins Zhejiang Province 2023 Scientific and Technological Progress Award

AsiaInfo Technology: New Tech Exploration

Nov 15, 2024 · Artificial Intelligence

How PaaS for AI Optimizes Large‑Model Workloads on Kubernetes

This article analyzes the three core technologies behind PaaS for AI—GPU resource management, node data optimization, and task scheduling—detailing their concepts, component architecture, critical workflows, technical advantages, and future challenges, while illustrating practical configurations with Kubernetes and Volcano examples.

AIBig DataCloud Native

0 likes · 16 min read

How PaaS for AI Optimizes Large‑Model Workloads on Kubernetes

Architects' Tech Alliance

Oct 23, 2024 · Cloud Computing

NVIDIA vGPU vs AMD MxGPU: Architecture, Scheduling, and Virtualization Trade‑offs

This article explains GPU virtualization, comparing NVIDIA's software‑based vGPU and AMD's hardware‑based MxGPU, detailing their architecture, required hardware, licensing, performance indicators, resource scheduling strategies, slicing limits, and the advantages and drawbacks of each approach for virtualized workloads.

AMD MxGPUGPU virtualizationNVIDIA vGPU

0 likes · 12 min read

NVIDIA vGPU vs AMD MxGPU: Architecture, Scheduling, and Virtualization Trade‑offs

360 Zhihui Cloud Developer

Oct 15, 2024 · Cloud Computing

How 360’s OpenStack Scheduler Optimizes Multi‑Cloud Resource Allocation

This article explains how 360’s cloud platform uses a three‑layer architecture and Nova‑scheduler to manage thousands of servers and tens of thousands of VMs across multiple OpenStack clusters, detailing scheduling policies, resource‑pool handling, current challenges, and future improvement plans.

OpenStackmulti-cloudnova-scheduler

0 likes · 10 min read

How 360’s OpenStack Scheduler Optimizes Multi‑Cloud Resource Allocation

Cloud Native Technology Community

Aug 28, 2024 · Cloud Native

Kubernetes 1.31 Introduces the Alpha ‘distribute-cpus-across-cores’ Option in CPUManager Static Policy

Kubernetes 1.31 adds an alpha‑stage ‘distribute-cpus-across-cores’ option to the CPUManager static policy, allowing CPUs to be spread across physical cores for better cache locality, reduced contention, and improved performance in multi‑core and performance‑sensitive workloads.

CPUManagerCloud NativeKubernetes

0 likes · 7 min read

Kubernetes 1.31 Introduces the Alpha ‘distribute-cpus-across-cores’ Option in CPUManager Static Policy

Alibaba Cloud Native

Jan 16, 2024 · Cloud Native

What’s New in Koordinator v1.4.0? A Deep Dive into Mixed‑Workload Scheduling and Resource Optimizations

Koordinator v1.4.0 introduces mixed K8s/YARN workloads, NUMA‑aware scheduling, CPU‑normalization, enhanced ElasticQuota with tree structures and non‑preemptible pods, cold‑memory reporting, QoS for non‑containerized applications, and a suite of bug‑fixes and performance improvements for enterprise Kubernetes clusters.

CPU normalizationElasticQuotaKoordinator

0 likes · 24 min read

What’s New in Koordinator v1.4.0? A Deep Dive into Mixed‑Workload Scheduling and Resource Optimizations

Architecture & Thinking

Jan 14, 2024 · Artificial Intelligence

How Baidu Scales Content Understanding to Trillions of Pages with AI Engineering

This article explains how Baidu processes internet‑scale content by applying deep AI‑driven understanding, detailing cost‑optimization, efficiency improvements, model‑service frameworks, resource‑scheduling systems, and batch‑compute platforms that together enable trillion‑level indexing and feature extraction.

AI EngineeringBatch ComputingHTAP storage

0 likes · 16 min read

How Baidu Scales Content Understanding to Trillions of Pages with AI Engineering

360 Smart Cloud

Jan 10, 2024 · Cloud Native

Mixed Workload Scheduling (混部) in Kubernetes: Challenges, Core Technologies, and Koordinator Enhancements

The article analyzes low CPU utilization in pure online Kubernetes clusters, introduces mixed‑workload (online + offline) scheduling to improve resource efficiency, explains core techniques, kernel QoS features, and details Koordinator‑based implementations such as node resource reservation and scheduling adjustments.

Cloud NativeKoordinatorKubernetes

0 likes · 13 min read

Mixed Workload Scheduling (混部) in Kubernetes: Challenges, Core Technologies, and Koordinator Enhancements

Xiaohongshu Tech REDtech

Nov 27, 2023 · Cloud Native

Mixed-Workload Scheduling and Resource Utilization Optimization in Xiaohongshu's Cloud-Native Platform

Xiaohongshu’s cloud‑native platform adopted a four‑stage mixed‑workload scheduling strategy—reusing idle nodes, whole‑machine time‑sharing, normal mixed pools, and a unified scheduler (Tusker) that coordinates CPU, GPU and memory across Kubernetes and YARN—boosting average cluster CPU utilization from under 20 % to over 45 % and delivering millions of low‑cost core‑hours while preserving QoS for latency‑sensitive, mid, and batch jobs.

Big DataKubernetesQoS

0 likes · 19 min read

Mixed-Workload Scheduling and Resource Utilization Optimization in Xiaohongshu's Cloud-Native Platform

Huolala Tech

Nov 23, 2023 · Artificial Intelligence

How HuoLaLa Built a Custom ASR System to Boost Accuracy and Cut Costs

This article details HuoLaLa's development of an in‑house Automatic Speech Recognition system, covering its architecture, VAD optimization, language‑model and hot‑word enhancements, punctuation restoration, task and resource scheduling, and the resulting improvements in accuracy and cost efficiency.

ASRLanguage ModelVAD

0 likes · 18 min read

How HuoLaLa Built a Custom ASR System to Boost Accuracy and Cut Costs

Baidu Geek Talk

Nov 20, 2023 · Operations

How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights

This article details Baidu Search's engineering practice for trillion‑scale content understanding, covering cost and efficiency challenges, model‑service framework, batch‑compute platform, resource‑scheduling system, HTAP storage design, and concrete optimization techniques such as multi‑process Python serving, dynamic batching, and two‑stage scheduling.

BaiduBig DataHTAP

0 likes · 18 min read

How Baidu Scales Content Understanding to Trillion‑Scale: Architecture, Optimization, and Scheduling Insights

iQIYI Technical Product Team

Nov 17, 2023 · Big Data

Mixed Workload Co-location of Big Data and Online Services at iQIYI: Design, Implementation, and Results

iQIYI’s mixed‑workload system colocates Spark/Hive big‑data jobs with online video services by running YARN NodeManagers inside Kubernetes, using an Elastic YARN Operator, Koordinator‑driven CPU oversubscription, and remote shuffle, boosting online CPU utilization from ~9 % to over 40 % and saving tens of millions of RMB annually.

Big DataCloud NativeKubernetes

0 likes · 19 min read

Mixed Workload Co-location of Big Data and Online Services at iQIYI: Design, Implementation, and Results

Didi Tech

Oct 19, 2023 · Cloud Native

Design and Implementation of a New Tiered Resource Guarantee System for Elastic Cloud Containers

The new tiered resource‑guarantee system for Didi’s elastic cloud containers defines S, A, and B priority levels with explicit over‑commit rules, upgrades OS, Kubernetes, kube‑odin, service‑tree, and CMP components, and thereby cuts CPU contention by up to 80%, reduces latency, improves scaling reliability, and lowers operational costs.

Container ManagementKubernetesOvercommit

0 likes · 16 min read

Design and Implementation of a New Tiered Resource Guarantee System for Elastic Cloud Containers

Didi Tech

Oct 12, 2023 · Cloud Computing

Elastic Cloud Mixed Deployment: Architecture, Scheduling, Isolation, and Future Directions

Didi's Elastic Cloud uses mixed deployment to co‑locate diverse services, employing tiered guarantees, custom Kubernetes scheduling, profiling, rescheduling, and isolation‑cluster techniques to boost utilization while preserving QoS, with a roadmap for broader automation and interference detection.

Dynamic Scalingmixed deploymentperformance isolation

0 likes · 25 min read

Elastic Cloud Mixed Deployment: Architecture, Scheduling, Isolation, and Future Directions

DataFunSummit

Aug 25, 2023 · Big Data

Big Data Meets Cloud Native: Tencent's Cloud‑Native Big Data Architecture, Challenges, and Practices

This article explores how Tencent integrates big data with cloud‑native technologies, detailing the evolution, opportunities, challenges, the peak‑range (FengLuan) architecture, engine and scheduling layers, mixed‑workload strategies, runtime optimizations, and future directions for large‑scale data platforms.

Cloud NativeTencentdistributed computing

0 likes · 17 min read

Big Data Meets Cloud Native: Tencent's Cloud‑Native Big Data Architecture, Challenges, and Practices

High Availability Architecture

May 26, 2023 · Big Data

Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster Resource Scheduling

This article introduces Amiya, a self‑developed overcommit component that dynamically increases Yarn memory and vCore capacity on Bilibili's offline big‑data clusters, details its architecture, key implementation of overcommit, eviction and mixed‑deployment strategies, and evaluates its resource‑utilization impact.

Cluster ManagementOvercommitResource Optimization

0 likes · 22 min read

Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster Resource Scheduling

DataFunTalk

May 25, 2023 · Artificial Intelligence

Optimizing Distributed Cache for Large-Scale Deep Learning Training with Alluxio and SiloD

This article examines the storage bottlenecks in large‑scale AI training, evaluates local‑disk and Alluxio‑based distributed caching strategies, proposes uniform cache eviction and replica‑aware global policies, and introduces the SiloD framework for coordinated compute‑storage scheduling to dramatically improve GPU utilization and overall cluster throughput.

AI trainingAlluxioCache Eviction

0 likes · 16 min read

Optimizing Distributed Cache for Large-Scale Deep Learning Training with Alluxio and SiloD

High Availability Architecture

Apr 3, 2023 · Cloud Native

Design and Implementation of Punica: A One‑Stop, Unattended AI Inference Platform

The article describes Punica, a cloud‑native, function‑as‑a‑service platform that unifies content‑understanding inference services through a one‑stop portal and unattended operations, improving deployment speed, resource utilization, and reducing manual effort for AI model serving.

AI inferenceFaaSService Orchestration

0 likes · 13 min read

Design and Implementation of Punica: A One‑Stop, Unattended AI Inference Platform

Baidu Geek Talk

Mar 29, 2023 · Cloud Native

Punica: A Cloud‑Native Platform for Content Understanding Inference Services

Punica provides a cloud‑native, one‑stop platform that unifies Baidu’s content‑understanding inference services, automates testing, resource provisioning, and monitoring, and enables unattended, self‑healing operations with dynamic scaling and GPU scheduling, cutting onboarding time by half and reclaiming hundreds of GPUs.

AI inferenceInference PlatformService Orchestration

0 likes · 14 min read

Punica: A Cloud‑Native Platform for Content Understanding Inference Services

Baidu Geek Talk

Feb 24, 2023 · Cloud Native

Design and Resource Scheduling of Cloud‑Native AI and the PaddleFlow Workflow Engine

The article explains Baidu’s cloud‑native AI resource scheduling across single‑ and multi‑GPU nodes, describes the PaddleFlow Kubernetes‑based workflow engine with its hierarchical queues, advanced scheduling algorithms, unified storage, and how these technologies improve GPU utilization, reduce fragmentation, and simplify AI task orchestration.

AIKubernetesPaddleFlow

0 likes · 23 min read

Design and Resource Scheduling of Cloud‑Native AI and the PaddleFlow Workflow Engine

Tencent Advertising Technology

Feb 17, 2023 · Big Data

Cost Optimization and Mixed‑Resource Deployment in Tencent's Taiji Machine Learning Platform

The article details how Tencent's Taiji machine‑learning platform reduces training costs and improves efficiency for large‑scale advertising models by leveraging cloud‑native mixed‑resource strategies—including online idle, offline elastic, and compute‑resource sharing—while maintaining high service stability through advanced scheduling, fault‑tolerance, and resource‑prediction techniques.

Big DataCloud NativeMachine Learning Platform

0 likes · 16 min read

Cost Optimization and Mixed‑Resource Deployment in Tencent's Taiji Machine Learning Platform

AntTech

Jan 17, 2023 · Cloud Computing

Insights on Green Computing: Challenges, Trends, and Solutions from Ant Group and Academia

The interview explores the rapid rise of green computing, examining energy consumption of data centers, low CPU utilization, software‑centric optimization, cloud‑native scheduling, AI and big‑data workloads, and future technical and educational efforts needed to achieve sustainable, low‑carbon computing at scale.

AIdata center energygreen computing

0 likes · 20 min read

Insights on Green Computing: Challenges, Trends, and Solutions from Ant Group and Academia

Volcano Engine Developer Services

Dec 15, 2022 · Cloud Native

How ByteDance Scaled Cloud‑Native Infrastructure: Lessons in Multi‑Cluster Scheduling

ByteDance’s cloud‑native transformation details a layered technical system, multi‑year Kubernetes‑based evolution, unified multi‑cluster resource management, and hierarchical scheduling, illustrating how the company achieves high development speed, resource efficiency, and prepares for next‑generation serverless infrastructure.

Cloud NativeDevOpsKubernetes

0 likes · 21 min read

How ByteDance Scaled Cloud‑Native Infrastructure: Lessons in Multi‑Cluster Scheduling

Baidu Intelligent Cloud Tech Hub

Dec 14, 2022 · Artificial Intelligence

How Cloud‑Native AI Boosts Resource Efficiency with PaddleFlow

This article explains how cloud‑native AI leverages container‑based architectures and advanced scheduling algorithms—such as resource queues, gang scheduling, bin‑packing, GPU topology‑aware and Tor‑aware dispatch—to improve resource and engineering efficiency, and introduces Baidu’s AI workflow engine PaddleFlow with its design, features, and deployment options.

AI workflowCloud Native AIGPU virtualization

0 likes · 25 min read

How Cloud‑Native AI Boosts Resource Efficiency with PaddleFlow

DataFunTalk

Dec 14, 2022 · Big Data

Cloud‑Native Big Data Solutions for the Financial Industry: Architecture, Deployment, Scheduling, and Resource Management

This article explains why the financial sector is moving its big‑data workloads to cloud‑native platforms, compares cloud‑native systems with traditional Hadoop, describes deployment options such as Serverless YARN and Arcee Operator, and details the high‑performance GRO scheduler, agent, and ResLake resource‑lake architecture that together improve resource utilization, reduce costs, and ensure reliable, low‑latency processing for finance workloads.

Big DataCloud Nativeresource scheduling

0 likes · 19 min read

Cloud‑Native Big Data Solutions for the Financial Industry: Architecture, Deployment, Scheduling, and Resource Management

Alibaba Cloud Native

Dec 11, 2022 · Cloud Native

How iQIYI Uses Dragonfly and Koordinator to Optimize Offline‑Online Mixed Workloads

This article details iQIYI's multi‑year journey of mixing offline and online workloads using Dragonfly and Koordinator, covering architectural evolution, key factors for successful co‑location, resource‑allocation strategies, the role of Anolis OS, pilot results, and future directions.

Anolis OSCloud NativeKoordinator

0 likes · 9 min read

How iQIYI Uses Dragonfly and Koordinator to Optimize Offline‑Online Mixed Workloads

DataFunSummit

Dec 10, 2022 · Big Data

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

This presentation details how Guanyuan Data leverages Apache Spark within its self‑service analytics platform, covering product features, flexible deployment, resource isolation, performance challenges, architectural solutions, and future cloud‑native enhancements to support thousands of users and massive query workloads.

Apache SparkBig DataData Platform

0 likes · 14 min read

Applying Apache Spark in Guanyuan Self-Service Analytics System: Architecture, Challenges, and Solutions

Volcano Engine Developer Services

Nov 28, 2022 · Cloud Native

How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok

ByteDance’s cloud‑native computing team, led by Li Yakun, details how they transformed a Hadoop‑centric big‑data stack into a Kubernetes‑driven platform—customizing storage, middleware, and scheduling—to support petabyte‑scale workloads, achieve over 40% resource utilization, and sustain rapid product growth.

Big DataCloud NativeSpark

0 likes · 17 min read

How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok

Alibaba Cloud Developer

Oct 20, 2022 · Cloud Native

Why Kubernetes Remains Complex and How Serverless Designs Aim to Simplify It

The article examines the inherent and accidental complexities of Kubernetes as a distributed cluster manager, discusses challenges in resource scheduling, infrastructure diversity, and operational overhead, and explores how cloud‑native solutions such as managed services, nodeless and serverless Kubernetes architectures attempt to reduce these complexities while introducing new trade‑offs.

Cloud NativeKubernetesOperations

0 likes · 18 min read

Why Kubernetes Remains Complex and How Serverless Designs Aim to Simplify It

vivo Internet Technology

Oct 9, 2022 · Artificial Intelligence

vivo Machine Learning Platform: Architecture Design and Practice

vivo’s machine‑learning platform, built for its massive app‑store and e‑commerce ecosystem, streamlines data processing, model training, and deployment through quota‑based resource management, a custom ultra‑large‑scale TensorFlow‑vlps framework, OpenAPI‑driven training, and Jupyter‑integrated interactive development, boosting efficiency for billions of samples and features.

Distributed TrainingMLOpsMachine Learning Platform

0 likes · 12 min read

vivo Machine Learning Platform: Architecture Design and Practice

DataFunSummit

Sep 25, 2022 · Big Data

Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

This article shares Xiaomi's internal practices of Hadoop YARN, covering scheduling and resource optimization, elastic scheduling, node overcommit handling, federation architecture, metadata warehouse construction, and future plans to improve cluster utilization and cost efficiency.

Big DataHadoopYARN

0 likes · 20 min read

Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

Bilibili Tech

Aug 27, 2022 · Cloud Native

Mixed Workload Co-location Practices in Bilibili's Kubernetes Cloud Platform

Bilibili’s Kubernetes cloud platform boosts server utilization by co‑locating latency‑sensitive online services with batch‑oriented offline jobs on the same nodes, using custom schedulers, extended resources, dynamic CPU/memory isolation, and a management console, achieving average CPU usage around 35 % and significant cost savings.

Cloud NativeCo-locationKubernetes

0 likes · 17 min read

Mixed Workload Co-location Practices in Bilibili's Kubernetes Cloud Platform

Meituan Technology Team

Aug 11, 2022 · Cloud Native

LAR: Load Auto-Regulator System for Resource Utilization and Service Quality

The article analyzes Meituan’s self‑designed Load Auto‑Regulator (LAR), detailing its tiered resource‑pool architecture, dynamic load‑to‑static‑resource mapping, and QoS mechanisms that together raise data‑center CPU utilization by 5‑10% while keeping online service quality stable, and discusses its deployment in online and mixed‑workload scenarios.

Cloud NativeCluster ManagementKubernetes

0 likes · 28 min read

LAR: Load Auto-Regulator System for Resource Utilization and Service Quality

AntTech

Aug 1, 2022 · Cloud Native

Green Computing Strategies and Cloud‑Native Architecture at the 2022 China Computing Power Conference

In his Ant Group presentation at the 2022 China Computing Power Conference, He Zhengyu outlined how cloud‑native upgrades, time‑slice scheduling, AI‑driven capacity prediction, and mixed online‑offline deployment have dramatically improved server utilization, cut carbon emissions, and driven open‑source contributions toward sustainable computing.

AI elasticitydata center efficiencygreen computing

0 likes · 10 min read

Green Computing Strategies and Cloud‑Native Architecture at the 2022 China Computing Power Conference

DataFunTalk

Jul 21, 2022 · Big Data

Large-Scale Offline‑Online Mixed Deployment at Huya: Architecture, Challenges, and Solutions

This article describes Huya's large‑scale offline‑online mixed deployment, detailing the low resource‑utilization problems, the time‑sharing and elastic scheduling solutions, the containerized architecture, multi‑datacenter isolation, heterogeneous resource handling, stability safeguards, and the resulting performance improvements and future directions.

Big DataHuyacontainerization

0 likes · 13 min read

Large-Scale Offline‑Online Mixed Deployment at Huya: Architecture, Challenges, and Solutions

Volcano Engine Developer Services

Jul 12, 2022 · Cloud Native

How ByteDance Scales Cloud‑Native Infrastructure with Hybrid K8s Scheduling

ByteDance’s cloud‑native ecosystem combines a multi‑layered architecture, dynamic resource over‑provisioning control, hybrid online‑offline scheduling, and federated cluster management to boost container utilization from 23% to 63%, reduce costs by 40%, and support massive events like the 2021 Spring Festival Gala.

Cloud Nativehybrid deploymentlarge scale

0 likes · 16 min read

How ByteDance Scales Cloud‑Native Infrastructure with Hybrid K8s Scheduling

Huawei Cloud Developer Alliance

May 30, 2022 · Cloud Computing

How a Young Huawei Engineer Saved Millions with Cloud Resource Scheduling Optimization

This interview follows 25‑year‑old Huawei Cloud engineer Tong Hao, who leveraged his competition‑honed algorithm skills to develop a universal resource‑re‑scheduling solver that fills “Tetris‑like” gaps in data‑center capacity, cutting operational costs by tens of millions of yuan while advancing cloud security and intelligent scheduling.

Career Developmentalgorithm engineeringcloud computing

0 likes · 10 min read

How a Young Huawei Engineer Saved Millions with Cloud Resource Scheduling Optimization

JD Retail Technology

Apr 27, 2022 · Industry Insights

How JD Achieves Seamless Stability During Massive Sales Events

The article reviews the Global Information System Stability Summit and JD's technical architect Li Junliang's detailed case study on the engineering practices, observability, chaos engineering, and resource‑scheduling innovations that enable JD’s e‑commerce platform to handle sales‑peak traffic that spikes hundreds of times over normal load.

Observabilitychaos engineeringe‑commerce

0 likes · 7 min read

How JD Achieves Seamless Stability During Massive Sales Events

ITPUB

Apr 27, 2022 · Artificial Intelligence

How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%

This article details the design and optimization of 58.com’s WPAI machine learning platform, covering background, training‑task scheduling, elastic inference scaling, offline‑online resource mixing, and model‑inference acceleration, and shows how these techniques collectively raised GPU usage by 51% and CPU usage by 38% while cutting costs.

AI PlatformGPU utilizationInference Acceleration

0 likes · 26 min read

How 58’s WPAI Platform Boosted AI Resource Utilization by Over 50%

Zuoyebang Tech Team

Apr 26, 2022 · Cloud Native

How Serverless Kubernetes Virtual Nodes Cut Costs and Boost Scalability

Zhang's team at Zuoyebang details their journey to serverless Kubernetes virtual nodes, explaining how elastic scaling, fine-grained scheduling, and cost‑effective resource utilization transformed high‑peak online services, while addressing challenges in scheduling, observability, performance, and multi‑cloud resilience.

Cost OptimizationKubernetesServerless

0 likes · 11 min read

How Serverless Kubernetes Virtual Nodes Cut Costs and Boost Scalability

DataFunTalk

Feb 17, 2022 · Cloud Native

ByteDance's Cloud‑Native Transformation of Its Machine Learning Platform

This article explains how ByteDance redesigned its machine‑learning platform using cloud‑native principles, detailing motivations, the shift from Yarn to Kubernetes, the implementation of PS‑Worker and AllReduce frameworks, unified operators, heterogeneous resource scheduling, elastic training, and future directions for large‑scale AI workloads.

cloud-nativeelastic-trainingheterogeneous-compute

0 likes · 15 min read

ByteDance's Cloud‑Native Transformation of Its Machine Learning Platform

Amap Tech

Jan 6, 2022 · Mobile Development

Three‑Year Full‑Chain Performance Optimization at Gaode Map: Strategies, Practices, and Results

Over three years Gaode Map halved overall latency by systematically identifying bottlenecks, applying reverse‑order targeted fixes, establishing forward‑order long‑term controls, and deploying adaptive resource scheduling, engine acceleration, H5 container enhancements, high‑performance components, and CI automation, resulting in sustainable core‑chain performance improvements and a better user experience.

EngineeringGaode MapMobile

0 likes · 18 min read

Three‑Year Full‑Chain Performance Optimization at Gaode Map: Strategies, Practices, and Results

Baidu Geek Talk

Jan 5, 2022 · Cloud Native

Baidu Cloud‑Native Mixed Workload (Offline Co‑location) Technology Overview

Baidu’s mixed‑workload approach co‑locates offline batch jobs with latency‑sensitive online services on shared nodes, using a dynamic resource view, priority‑based scheduling, cpuset/NUMA isolation, eBPF policies, and predictive profiling, boosting CPU utilization above 40 % and saving billions of RMB in total cost of ownership.

KubernetesMixed Workloadcloud-native

0 likes · 17 min read

Baidu Cloud‑Native Mixed Workload (Offline Co‑location) Technology Overview

Baidu Tech Salon

Dec 31, 2021 · Industry Insights

How Baidu Boosted CPU Utilization by Up to 80% with Offline Mixed‑Tenant Scheduling

This article analyzes Baidu's offline mixed‑tenant technology that combines online and offline workloads on the same physical servers, detailing the resource‑usage problems, dynamic resource views, priority schemes, isolation mechanisms, high‑performance scheduling, and future directions for cloud‑native clusters.

Cloud NativeKubernetescpu-utilization

0 likes · 18 min read

How Baidu Boosted CPU Utilization by Up to 80% with Offline Mixed‑Tenant Scheduling

Alibaba Terminal Technology

Dec 22, 2021 · Mobile Development

How Gaode Map Halved Core Link Latency: A 3‑Year Mobile Performance Overhaul

Over three years Gaode Map continuously optimized its full‑link performance, cutting core‑link latency by half through a systematic three‑step approach—identifying bottlenecks, reverse‑specialized problem solving, and long‑term forward control—resulting in dramatically improved user experience across diverse device tiers.

Gaode MapPerformance Optimizationengineering process

0 likes · 18 min read

How Gaode Map Halved Core Link Latency: A 3‑Year Mobile Performance Overhaul

DataFunTalk

Nov 2, 2021 · Artificial Intelligence

Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training

The article outlines a technical exchange hosted by 58.com AI Lab and Tianjin University that discusses high‑efficiency AI computing, resource‑aware scheduling for both online inference and offline training, and methods to mitigate GPU under‑utilization and gray‑interference in distributed deep‑learning platforms.

AIGPU utilizationInference

0 likes · 4 min read

Optimizing AI Platform Resource Efficiency: Scheduling Strategies for Deep Learning Inference and Training

Tencent Cloud Developer

Jun 21, 2021 · Industry Insights

How Hadoop YARN on Kubernetes Pods Supercharge Resource Utilization and Cut Costs

This article explains how Tencent Cloud EMR integrated Hadoop YARN with Kubernetes Pods to create a hybrid online‑offline deployment, implement elastic autoscaling and multi‑label resource allocation, and achieve several‑hundred‑percent improvements in CPU utilization while preserving cluster stability.

Big DataCloud NativeHadoop

0 likes · 11 min read

How Hadoop YARN on Kubernetes Pods Supercharge Resource Utilization and Cut Costs

DataFunTalk

Jun 13, 2021 · Artificial Intelligence

GPU Virtual Sharing for AI Inference Services on Kubernetes

The article presents a GPU virtual‑sharing solution for AI inference workloads that isolates memory and compute resources via CUDA API interception, integrates with Kubernetes using the open‑source aliyun‑gpushare scheduler, and demonstrates doubled GPU utilization and minimal performance loss across multiple tests.

CUDAGPU virtualizationKubernetes

0 likes · 16 min read

GPU Virtual Sharing for AI Inference Services on Kubernetes

iQIYI Technical Product Team

May 28, 2021 · Artificial Intelligence

iQIYI GPU Virtual Sharing for AI Inference: Architecture, Isolation, and Scheduling

iQIYI created a custom GPU‑virtual‑sharing system that intercepts CUDA calls to enforce per‑container memory limits, rewrites kernel launches for compute isolation, and integrates with a Kubernetes scheduler extender, allowing multiple AI inference containers to share a single V100 with minimal overhead and more than doubling overall GPU utilization.

AI inferenceCUDAGPU virtualization

0 likes · 16 min read

iQIYI GPU Virtual Sharing for AI Inference: Architecture, Isolation, and Scheduling

Alibaba Cloud Developer

May 19, 2021 · Cloud Computing

How to Optimize Cloud Resource Scheduling After Migration

After migrating to the cloud, enterprises must evaluate resource scale, cost pressure, and staffing before deciding whether to build their own scheduling system, and can choose among ECS, Dedicated Host, or private pool solutions, each with specific advantages, drawbacks, and suitable scenarios.

Auto Scalingcapacity planningdedicated host

0 likes · 15 min read

How to Optimize Cloud Resource Scheduling After Migration

DataFunTalk

Mar 3, 2021 · Big Data

Kwai Scheduler: Scaling YARN for Ultra‑Large Clusters at Kuaishou

This article presents Kuaishou's large‑scale offline computing challenges and describes how the team customized YARN and built the Kwai scheduler to achieve multi‑threaded, pluggable resource scheduling for clusters of tens of thousands of nodes, supporting diverse workloads such as ETL, ad‑hoc queries, machine‑learning training, and real‑time Flink jobs.

Cluster OptimizationKwai SchedulerYARN

0 likes · 15 min read

Kwai Scheduler: Scaling YARN for Ultra‑Large Clusters at Kuaishou

Cloud Native Technology Community

Mar 3, 2021 · Operations

How Facebook Scales Millions of Servers with Twine: Inside Its Cluster Management Engine

This article explains how Facebook’s Twine system orchestrates containers across millions of servers, detailing its architecture, support for stateful services, cross‑data‑center control, elastic capacity handling, and the lessons learned from eight years of large‑scale operations.

Cluster ManagementFacebookOperations

0 likes · 15 min read

How Facebook Scales Millions of Servers with Twine: Inside Its Cluster Management Engine

DataFunSummit

Feb 4, 2021 · Artificial Intelligence

Full-Stack Machine Learning Platform: Architecture, Key Factors, and Implementation Details

This article examines the evolution of user data, computing power, and models, and presents the design principles, key architectural factors, and practical implementation techniques for building a full‑stack machine learning platform that supports large‑scale data processing, distributed training, and low‑latency online serving.

Big Data IntegrationMachine Learning Platformdata pipelines

0 likes · 15 min read

Full-Stack Machine Learning Platform: Architecture, Key Factors, and Implementation Details

Amap Tech

Jan 15, 2021 · Mobile Development

Low-Cost Performance Optimization and Long-Term Control for Super Apps: Gaode Map Case Study

Gaode Map’s low‑cost, long‑term performance strategy for super‑apps combines an adaptive resource‑scheduling framework, full‑dimension monitoring, and closed‑loop control to cut startup time over 70%, shrink memory use 30% and binary size 20%, delivering up to three‑fold speed gains on low‑end devices while preserving development efficiency.

Gaode MapSuper Appmobile app

0 likes · 13 min read

Low-Cost Performance Optimization and Long-Term Control for Super Apps: Gaode Map Case Study

Suning Technology

Aug 20, 2020 · Cloud Computing

How Suning Scaled Cloud Resources for the 818 Mega‑Sale: 10% Server Utilization Boost

Suning Cloud leveraged micro‑scheduling, dynamic resource allocation, and AI‑driven security to handle the massive traffic of the 818 promotion, achieving a 10% increase in physical‑machine utilization while maintaining stability, cost efficiency, and robust protection against cyber attacks.

AI securityDynamic Scalingcloud computing

0 likes · 8 min read

How Suning Scaled Cloud Resources for the 818 Mega‑Sale: 10% Server Utilization Boost

Tongcheng Travel Technology Center

May 8, 2020 · Cloud Native

Design and Practices of Tongcheng‑Elong Container Platform: Cloud‑Native Architecture, Scheduling, and Resource Optimization

This article details Tongcheng‑Elong's journey from bare‑metal to a Kubernetes‑based cloud‑native platform, describing its architecture, the challenges of isolation, scheduling, resource utilization and promotion, and the engineering solutions—including custom scheduling, CPU binding, IP fixation, and over‑commit strategies—implemented to improve efficiency and reliability.

Kubernetescontainer platformresource scheduling

0 likes · 16 min read

Design and Practices of Tongcheng‑Elong Container Platform: Cloud‑Native Architecture, Scheduling, and Resource Optimization

Alibaba Cloud Developer

Jan 23, 2020 · Cloud Native

How Alibaba Cloud Tackles Bursty Peak Loads with Container‑Based Hybrid Deployment

Alibaba Cloud’s award‑winning solutions address bursty peak‑load challenges by integrating container‑based hybrid deployment, intelligent scheduling, and resource isolation, enabling massive e‑commerce events, gene‑computing tasks, and national ticketing systems to achieve high performance, low cost, and near‑zero incremental investment.

Alibaba Cloudburst traffichybrid deployment

0 likes · 17 min read

How Alibaba Cloud Tackles Bursty Peak Loads with Container‑Based Hybrid Deployment

Tencent Cloud Developer

Jan 9, 2020 · Fundamentals

TencentOS Kernel: Tencent Cloud's Open-Source Server OS

Tencent Cloud has open‑sourced its server‑grade operating system kernel, TencentOS Kernel, which offers cloud‑optimized resource scheduling, enhanced container isolation, ARM64 hot‑patching, and performance‑security optimizations that boost CPU utilization and lower operating costs, extending the TencentOS family after the tiny IoT release.

Operating SystemsPerformance Optimizationopen-source

0 likes · 11 min read

TencentOS Kernel: Tencent Cloud's Open-Source Server OS

Big Data Technology & Architecture

Oct 20, 2019 · Cloud Native

Meituan-Dianping Kubernetes Cluster Management and Optimization Practices

This article details Meituan-Dianping's evolution from custom Docker‑based scaling to a Kubernetes‑driven, cloud‑native cluster management platform (HULK), describing its architecture, scheduler enhancements, Kubelet modifications, and resource‑optimization strategies for large‑scale operations.

Cloud NativeCluster ManagementKubernetes

0 likes · 17 min read

Meituan-Dianping Kubernetes Cluster Management and Optimization Practices

Qunar Tech Salon

Aug 22, 2019 · Big Data

Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler

This article details Meituan's experience optimizing the Hadoop YARN fair scheduler, covering background challenges, architectural components, resource abstractions, scheduling flow, performance metrics, a series of code‑level optimizations, stability strategies for production rollout, and future directions for large‑scale cluster scheduling.

Big DataFair SchedulerLoad Simulation

0 likes · 23 min read

Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler

Big Data Technology & Architecture

Apr 12, 2019 · Big Data

Weekly Knowledge Summary: Yarn Resource Scheduler, Hadoop Rack Awareness, HDFS Data Flow, and Small File Solutions

This weekly note shares personal updates and a concise technical overview covering Yarn's resource scheduling, Hadoop's rack‑aware architecture, HDFS data flow, and practical solutions to the HDFS small‑file problem, along with links to further reading and upcoming work plans.

Big DataHDFSHadoop

0 likes · 5 min read

Weekly Knowledge Summary: Yarn Resource Scheduler, Hadoop Rack Awareness, HDFS Data Flow, and Small File Solutions

JD Tech

Jan 17, 2019 · Operations

Technical Overview of JD's Archimedes Resource Scheduling System

The article presents a detailed technical analysis of JD's Archimedes project, describing its evolution from JDOS 2.0 to a large‑scale container scheduling platform that dramatically improves resource utilization, deployment speed, and cost efficiency across JD’s data centers.

AIBig DataJD

0 likes · 6 min read

Technical Overview of JD's Archimedes Resource Scheduling System

Liulishuo Tech Team

Nov 29, 2018 · Cloud Native

Building an Efficient Machine Learning Training Platform on Kubernetes

This article describes how the Liulishuo algorithm team designed and implemented a Kubernetes‑based training platform that addresses the iterative, data‑intensive, and resource‑dynamic characteristics of machine learning workloads by pooling resources, enabling rapid provisioning, and optimizing scheduling and storage.

Cloud NativeKubernetesmachine learning

0 likes · 9 min read

Building an Efficient Machine Learning Training Platform on Kubernetes

dbaplus Community

Nov 11, 2018 · Operations

How 360 Built an AI‑Powered Ops System to Cut Costs and Boost Efficiency

360’s AI‑ops team shares a year‑long journey of turning massive operational data into intelligent solutions—covering background, their AIOps philosophy, practical modules like capacity forecasting, host classification, resource reclamation, smart MySQL scheduling, anomaly detection, alarm reduction, and root‑cause analysis—to dramatically improve cost, efficiency, and reliability.

Capacity Forecastingaiopsanomaly detection

0 likes · 16 min read

How 360 Built an AI‑Powered Ops System to Cut Costs and Boost Efficiency

Alibaba Cloud Developer

Oct 23, 2018 · Operations

Unlocking Resource Efficiency: Alibaba’s Mixed‑Deployment (Co‑location) Strategy

This article explains how Alibaba’s mixed‑deployment (co‑location) technology combines online transaction services and offline compute workloads on shared physical servers, detailing its architecture, scheduling mechanisms, resource‑concession strategies, achieved performance gains, and future directions for large‑scale e‑commerce infrastructure.

AlibabaCo-locationOperations

0 likes · 23 min read

Unlocking Resource Efficiency: Alibaba’s Mixed‑Deployment (Co‑location) Strategy

360 Tech Engineering

Sep 29, 2018 · Operations

Design and Implementation of a Multi‑Task Scheduling System Based on Apache Mesos

This article describes how we identified underutilized CPU and memory resources in our company's servers, evaluated Kubernetes versus Apache Mesos, and built a non‑intrusive, Mesos‑based multi‑task scheduling system with dynamic resource reservation, monitoring, task isolation, and cluster‑wide observability, while addressing deployment challenges.

Cluster ManagementDocker alternativeMesos

0 likes · 11 min read

Design and Implementation of a Multi‑Task Scheduling System Based on Apache Mesos

360 Zhihui Cloud Developer

Sep 26, 2018 · Cloud Computing

How We Built a Multi‑Task Scheduler with Mesos on Legacy Servers

This article explains how we leveraged Apache Mesos to create a multi‑task scheduling system that maximizes idle CPU and memory on legacy CentOS machines without kernel upgrades, detailing architecture, deployment, monitoring, resource isolation, and remaining challenges.

Cluster ManagementMesoscontainerization

0 likes · 12 min read

How We Built a Multi‑Task Scheduler with Mesos on Legacy Servers

Alibaba Cloud Developer

Jun 26, 2018 · Operations

How Scheduling Algorithms Power Efficient Data Center Resource Management

Scheduling algorithms are a crucial component of cluster resource management systems, determining where containerized tasks run to ensure resource needs, high availability, fault tolerance, and cost efficiency across individual containers, applications, and entire data centers, while also supporting Alibaba’s global scheduling challenge.

Cluster ManagementData centeralgorithm competition

0 likes · 10 min read

How Scheduling Algorithms Power Efficient Data Center Resource Management

360 Tech Engineering

May 2, 2018 · Operations

Applying Mesos and Docker Containerization in 360 Commercial Advertising System

This article details how 360's commercial advertising platform leverages Mesos and Docker containerization to solve data‑center migration, fault recovery, OS inconsistencies, and resource‑utilization challenges, describing the architecture, standardization, networking, storage, service discovery, and future plans.

Cloud NativeDockerMesos

0 likes · 22 min read

Applying Mesos and Docker Containerization in 360 Commercial Advertising System

Efficient Ops

Mar 6, 2018 · Cloud Computing

How Alibaba Cuts Costs by 30% with Co‑Location Scheduling (Mix‑Deploy)

This article explains Alibaba's co‑location (混部) technology that mixes online services and batch compute on the same physical servers, detailing its background, key characteristics, scheduling architecture, resource isolation mechanisms, cost‑saving formulas, and future roadmap, showing how it boosts utilization and reduces expenses.

AlibabaCo-locationcloud computing

0 likes · 15 min read

How Alibaba Cuts Costs by 30% with Co‑Location Scheduling (Mix‑Deploy)

Alibaba Cloud Developer

Mar 5, 2018 · Operations

Boosting Test Environment Stability: Automated Container Replacement & Buffer Pools

This article analyzes the instability of Alibaba's test environment container provisioning, identifies root causes, and presents a comprehensive solution—including automatic container replacement, a buffer pool, and resource‑pool rationalization—that raised the container success rate to 99.9% and stabilized performance.

Operationsbuffer poolcontainer orchestration

0 likes · 9 min read

Boosting Test Environment Stability: Automated Container Replacement & Buffer Pools

JD Retail Technology

Feb 27, 2018 · Cloud Computing

JD.com's Archimedes: A Comprehensive Data Center Operating System

JD.com's Archimedes is a data center operating system that provides one-click scaling, intelligent scheduling, mixed deployment, and multi-site active-active capabilities, significantly improving resource utilization and reducing IT costs.

ArchimedesJD.comcontainer orchestration

0 likes · 5 min read

JD.com's Archimedes: A Comprehensive Data Center Operating System

Alibaba Cloud Developer

Feb 12, 2018 · Operations

How Alibaba’s Co‑Location (Mixed‑Deployment) Cuts Costs and Boosts Utilization

Alibaba’s mixed‑deployment (Co‑location) technology combines online services and batch compute tasks on shared physical resources, using priority‑based scheduling, resource isolation, and dynamic memory management to dramatically improve CPU utilization, cut infrastructure costs, and maintain service level objectives during peak traffic events.

Co-locationCost Optimizationcloud infrastructure

0 likes · 16 min read

How Alibaba’s Co‑Location (Mixed‑Deployment) Cuts Costs and Boosts Utilization

Architecture Digest

Dec 23, 2017 · Cloud Computing

Design and Practices of an Elastic Computing Platform for Efficient Resource Utilization

This article describes the design, challenges, and operational practices of a cloud‑native elastic computing platform that reuses idle resources from production servers to support massive image compression, video transcoding, AI inference, and log processing while ensuring online services maintain performance, latency, and reliability.

OOM handlingPerformance Monitoringcloud infrastructure

0 likes · 13 min read

Design and Practices of an Elastic Computing Platform for Efficient Resource Utilization

Alibaba Cloud Developer

Dec 22, 2017 · Cloud Computing

How Alibaba Scaled Double 11: Inside the Cloud Architecture Evolution

Over nine years, Alibaba transformed its Double 11 e‑commerce platform from a centralized system to a highly elastic, cloud‑native architecture, employing distributed design, multi‑active regions, unified scheduling, containerization with Pouch, and hybrid deployment to dramatically cut costs and boost peak throughput.

Alibabacontainerizationresource scheduling

0 likes · 19 min read

How Alibaba Scaled Double 11: Inside the Cloud Architecture Evolution

Alibaba Cloud Developer

Dec 19, 2017 · Operations

How Alibaba’s TPP Intelligent Scheduler Boosts Resource Utilization and Handles Double‑11 Traffic

The article details Alibaba's Taobao Personalization Platform (TPP) intelligent scheduling system, explaining its architecture, optimization algorithms, convergence logic, and performance results that dramatically improve CPU utilization and automate scaling during both regular operation and high‑traffic events like Double‑11.

AlibabaAuto Scalingcloud operations

0 likes · 21 min read

How Alibaba’s TPP Intelligent Scheduler Boosts Resource Utilization and Handles Double‑11 Traffic

Alibaba Cloud Developer

Sep 6, 2017 · Operations

How Mixed Offline/Online Scheduling Boosted Alibaba’s Data Center Utilization by 30%

The article explains how rapid internet growth has expanded data centers, why traditional operations fall short, presents a simple utilization formula, shows Alibaba’s mixed offline‑online scheduling experiment that raised server usage from 10% to over 40%, and announces an open dataset for academic research.

AlibabaCluster Managementdata center utilization

0 likes · 7 min read

How Mixed Offline/Online Scheduling Boosted Alibaba’s Data Center Utilization by 30%

21CTO

Aug 19, 2017 · Cloud Computing

How Tencent’s Elastic Platform Powers Billions of Daily Image Compressions with 6K Containers

Tencent’s elastic computing platform replaces 24,000 physical servers with just 6,000 containers, delivering sustainable compute for billions of daily image compressions while also supporting video transcoding, Spark jobs, and AI workloads through dynamic resource isolation, named services, and intelligent scheduling.

cloud infrastructurecontainer orchestrationelastic computing

0 likes · 8 min read

How Tencent’s Elastic Platform Powers Billions of Daily Image Compressions with 6K Containers

Architects' Tech Alliance

Mar 15, 2017 · Cloud Native

Docker Swarm on Apache Mesos: Architecture, Integration Guide, and Practical Considerations

This article explains the architecture of Docker Swarm, the reasons for running it on Apache Mesos, the integration process—including resource offers, task scheduling, and container creation—and discusses current limitations and future improvement directions.

Apache MesosCloud NativeCluster Management

0 likes · 12 min read

Docker Swarm on Apache Mesos: Architecture, Integration Guide, and Practical Considerations

High Availability Architecture

Mar 7, 2017 · Cloud Native

Tencent Games’ 3‑Year Journey of Kubernetes Adoption and Optimization for Large‑Scale Online Gaming

This article details how Tencent Games built, customized, and continuously optimized a Kubernetes‑based container platform over three years to support tens of thousands of game containers, covering deployment modes, scheduler enhancements, network solutions, resource quotas, monitoring, storage, and the transition to micro‑service architectures.

Cloud NativeKubernetesMicroservices

0 likes · 19 min read

Tencent Games’ 3‑Year Journey of Kubernetes Adoption and Optimization for Large‑Scale Online Gaming

Alibaba Cloud Developer

Aug 25, 2016 · Operations

Comparing Modern Data‑Center Schedulers: Borg, Mesos, Omega, Kubernetes & Zeus

This article examines resource allocation philosophies—auction, budgeting, and preemption—and compares the architectures, data models, and APIs of major schedulers such as Borg, Omega, Mesos, Kubernetes, and Alibaba’s Zeus, while also exploring sharing strategies, task classifications, utilization metrics, and predictive techniques for efficient resource management.

BorgCluster ManagementKubernetes

0 likes · 34 min read

Comparing Modern Data‑Center Schedulers: Borg, Mesos, Omega, Kubernetes & Zeus

Efficient Ops

Jul 26, 2016 · Cloud Computing

How China Mobile Zhejiang Built a Private Cloud with MESOS – Key Lessons

This article details China Mobile Zhejiang's journey from early virtualization to a full private‑cloud platform built on MESOS, covering why MESOS was chosen, the evolution of their cloud stages, DCOS implementation, automatic scaling, service discovery, and the operational benefits achieved.

DCOSMesosprivate cloud

0 likes · 23 min read

How China Mobile Zhejiang Built a Private Cloud with MESOS – Key Lessons

MaGe Linux Operations

Jun 6, 2016 · Cloud Native

Borg, Omega, Mesos, Kubernetes vs Alibaba Zeus: Key Resource Scheduling Strategies

This article compares the resource allocation philosophies, architectural designs, data handling, and API models of Borg, Omega, Mesos, Kubernetes, and Alibaba's Zeus, discussing auction, budgeting, preemption, sharing models, task types, utilization, prediction, and practical implementation details for large‑scale cloud native environments.

Alibaba ZeusBorgKubernetes

0 likes · 34 min read

Borg, Omega, Mesos, Kubernetes vs Alibaba Zeus: Key Resource Scheduling Strategies

Alibaba Cloud Infrastructure

May 4, 2016 · Cloud Computing

Alibaba Zeus Resource Scheduling System: Architecture, Virtualization, and Operational Practices

The article examines Alibaba's Zeus resource scheduling platform, detailing its background, problem analysis, container‑based virtualization, distributed architecture, strategies for improving resource utilization such as overselling and hybrid deployment, as well as stability measures and automation for large‑scale operations.

AlibabaOperationscloud computing

0 likes · 12 min read

Alibaba Zeus Resource Scheduling System: Architecture, Virtualization, and Operational Practices

High Availability Architecture

Apr 27, 2016 · Operations

Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies

This article explains the Mesos distributed system kernel, its master‑slave architecture, fine‑grained resource scheduling, and how Qunar leverages Mesos and Marathon for log processing, Spark, Alluxio, and multi‑tenant services while addressing framework unification, HA, service discovery, and operational challenges.

Cluster ManagementFrameworkMarathon

0 likes · 14 min read

Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies

Efficient Ops

Jun 17, 2015 · Cloud Native

Project Eru: Scaling a Custom Docker Orchestration Platform to 10k Nodes

Project Eru, a homegrown Docker‑based orchestration system developed at Mango TV, replaces earlier PaaS attempts with a stateless, scalable core and agent architecture, leveraging Redis clusters, MacVLAN networking, and fine‑grained CPU allocation to achieve rapid, automated scaling across thousands of containers.

Macvlancontainer orchestrationredis

0 likes · 22 min read

Project Eru: Scaling a Custom Docker Orchestration Platform to 10k Nodes

High Availability Architecture

May 24, 2015 · Cloud Native

Design and Implementation of Project Eru: A Docker‑Based Cloud Native Scheduling Platform at Mango TV

The article recounts the evolution from Douban's App Engine to Mango TV's Nebulium Engine and finally Project Eru, describing how Docker, Redis Cluster, MacVLAN networking, and custom resource scheduling were combined to build a scalable, cloud‑native platform for heterogeneous workloads.

Cloud NativeDockerMacvlan

0 likes · 24 min read

Design and Implementation of Project Eru: A Docker‑Based Cloud Native Scheduling Platform at Mango TV