Tag

Resource Scheduling

1 views collected around this technical thread.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Feb 17, 2025 · Cloud Native

Optimizing Offline Pod Scheduling with Koordinator and Yarn-Operator

To reduce resource contention and improve offline task reliability, this article examines the challenges of using Koordinator with Hadoop Yarn pods on Kubernetes, proposes real‑time resource reporting and task‑level eviction strategies, details community and custom solutions, and outlines future enhancements with Volcano.

Big DataKoordinatorKubernetes
0 likes · 9 min read
Optimizing Offline Pod Scheduling with Koordinator and Yarn-Operator
Alimama Tech
Alimama Tech
Feb 12, 2025 · Artificial Intelligence

HighService: A High‑Performance Pythonic AI Service Framework for Model Inference and Global Resource Scheduling

HighService, Alibaba’s Pythonic AI service framework, accelerates large‑model inference and maximizes GPU utilization by separating CPU‑GPU processes, offering out‑of‑the‑box quantization, parallelism and caching, and dynamically reallocating idle GPUs across clusters through a master‑worker scheduler to keep online latency low while boosting offline throughput for diffusion and LLM workloads.

AI ServiceModel InferencePython
0 likes · 16 min read
HighService: A High‑Performance Pythonic AI Service Framework for Model Inference and Global Resource Scheduling
DataFunSummit
DataFunSummit
Feb 6, 2025 · Big Data

Migrating Big Data Workloads to Cloud‑Native Kubernetes: Challenges, Solutions, and Lessons from OPPO

This article describes how OPPO's big‑data team transitioned from traditional IDC and EMR environments to a cloud‑native Kubernetes architecture, detailing the motivations, design principles, elastic scaling challenges, custom solutions, and future directions for large‑scale data processing on the cloud.

Big DataKubernetesMulti-Cloud
0 likes · 18 min read
Migrating Big Data Workloads to Cloud‑Native Kubernetes: Challenges, Solutions, and Lessons from OPPO
Bilibili Tech
Bilibili Tech
Jan 24, 2025 · Operations

Design and Implementation of a CDN Edge‑Node Scheduling System for Bilibili Live Streaming

The paper presents Bilibili’s multi‑layer CDN edge‑node scheduling system, which groups heterogeneous nodes by quality and price, uses cost‑aware and resource‑aware heuristics—including maximum‑flow regional borrowing and contextual‑bandit utilization prediction—to allocate bandwidth per business, achieving a 43 % bandwidth reuse increase, 33 % coverage boost, and markedly lower stall rates.

BilibiliCost OptimizationLive Streaming
0 likes · 10 min read
Design and Implementation of a CDN Edge‑Node Scheduling System for Bilibili Live Streaming
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jan 16, 2025 · Cloud Native

Xiaohongshu Large-Scale Cloud-Native Mixed Deployment and Elasticity Practices

Xiaohongshu’s cloud‑native team transformed its over‑90% containerized services by introducing resource‑pooled mixed deployment, fine‑grained unified scheduling, and an elastic container pool with global HPA and cluster autoscaling—driving 35% of resources to mixed use, tens of millions of daily core‑hours, and roughly 30% cost savings while preparing for hybrid‑cloud expansion and FinOps.

ContainerizationOperating SystemResource Scheduling
0 likes · 7 min read
Xiaohongshu Large-Scale Cloud-Native Mixed Deployment and Elasticity Practices
AntTech
AntTech
Nov 22, 2024 · Cloud Native

Large-Scale Cloud‑Edge Collaborative Key Technologies and Applications Based on Cloud‑Native Architecture Wins Zhejiang Province 2023 Scientific and Technological Progress Award

The award‑winning cloud‑native large‑scale cloud‑edge collaborative project, developed by Alipay, Zhejiang University, Xieyun Technology and Alibaba Cloud, delivers unified resource scheduling for millions of heterogeneous devices, achieving significant performance gains, extensive patents, papers, standards, and substantial economic benefits across multiple industries.

AlipayEdge ComputingResource Scheduling
0 likes · 4 min read
Large-Scale Cloud‑Edge Collaborative Key Technologies and Applications Based on Cloud‑Native Architecture Wins Zhejiang Province 2023 Scientific and Technological Progress Award
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Nov 18, 2024 · Cloud Computing

How Dynamic Resource Scheduling Boosts OpenStack Efficiency and Cuts Costs

Virtualization resource scheduling algorithms, especially in OpenStack, address fragmented CPU allocation and uneven node utilization by dynamically consolidating VMs, employing NUMA-aware placement, and using resource scoring to trigger migrations, ultimately improving utilization, reducing costs, and enhancing performance in cloud environments.

NUMAOpenStackResource Scheduling
0 likes · 12 min read
How Dynamic Resource Scheduling Boosts OpenStack Efficiency and Cuts Costs
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 15, 2024 · Cloud Computing

How 360’s OpenStack Scheduler Optimizes Multi‑Cloud Resource Allocation

This article explains how 360’s cloud platform uses a three‑layer architecture and Nova‑scheduler to manage thousands of servers and tens of thousands of VMs across multiple OpenStack clusters, detailing scheduling policies, resource‑pool handling, current challenges, and future improvement plans.

Multi-CloudOpenStackResource Scheduling
0 likes · 10 min read
How 360’s OpenStack Scheduler Optimizes Multi‑Cloud Resource Allocation
Cloud Native Technology Community
Cloud Native Technology Community
Aug 28, 2024 · Cloud Native

Kubernetes 1.31 Introduces the Alpha ‘distribute-cpus-across-cores’ Option in CPUManager Static Policy

Kubernetes 1.31 adds an alpha‑stage ‘distribute-cpus-across-cores’ option to the CPUManager static policy, allowing CPUs to be spread across physical cores for better cache locality, reduced contention, and improved performance in multi‑core and performance‑sensitive workloads.

CPUManagerKubernetesResource Scheduling
0 likes · 7 min read
Kubernetes 1.31 Introduces the Alpha ‘distribute-cpus-across-cores’ Option in CPUManager Static Policy
Architecture & Thinking
Architecture & Thinking
Jan 14, 2024 · Artificial Intelligence

How Baidu Scales Content Understanding to Trillions of Pages with AI Engineering

This article explains how Baidu processes internet‑scale content by applying deep AI‑driven understanding, detailing cost‑optimization, efficiency improvements, model‑service frameworks, resource‑scheduling systems, and batch‑compute platforms that together enable trillion‑level indexing and feature extraction.

AI EngineeringHTAP storageResource Scheduling
0 likes · 16 min read
How Baidu Scales Content Understanding to Trillions of Pages with AI Engineering
360 Smart Cloud
360 Smart Cloud
Jan 10, 2024 · Cloud Native

Mixed Workload Scheduling (混部) in Kubernetes: Challenges, Core Technologies, and Koordinator Enhancements

The article analyzes low CPU utilization in pure online Kubernetes clusters, introduces mixed‑workload (online + offline) scheduling to improve resource efficiency, explains core techniques, kernel QoS features, and details Koordinator‑based implementations such as node resource reservation and scheduling adjustments.

KoordinatorKubernetesMixed Workload
0 likes · 13 min read
Mixed Workload Scheduling (混部) in Kubernetes: Challenges, Core Technologies, and Koordinator Enhancements
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Nov 27, 2023 · Cloud Native

Mixed-Workload Scheduling and Resource Utilization Optimization in Xiaohongshu's Cloud-Native Platform

Xiaohongshu’s cloud‑native platform adopted a four‑stage mixed‑workload scheduling strategy—reusing idle nodes, whole‑machine time‑sharing, normal mixed pools, and a unified scheduler (Tusker) that coordinates CPU, GPU and memory across Kubernetes and YARN—boosting average cluster CPU utilization from under 20 % to over 45 % and delivering millions of low‑cost core‑hours while preserving QoS for latency‑sensitive, mid, and batch jobs.

Big DataCPU utilizationKubernetes
0 likes · 19 min read
Mixed-Workload Scheduling and Resource Utilization Optimization in Xiaohongshu's Cloud-Native Platform
iQIYI Technical Product Team
iQIYI Technical Product Team
Nov 17, 2023 · Big Data

Mixed Workload Co-location of Big Data and Online Services at iQIYI: Design, Implementation, and Results

iQIYI’s mixed‑workload system colocates Spark/Hive big‑data jobs with online video services by running YARN NodeManagers inside Kubernetes, using an Elastic YARN Operator, Koordinator‑driven CPU oversubscription, and remote shuffle, boosting online CPU utilization from ~9 % to over 40 % and saving tens of millions of RMB annually.

Big DataKubernetesMixed Workload
0 likes · 19 min read
Mixed Workload Co-location of Big Data and Online Services at iQIYI: Design, Implementation, and Results
Didi Tech
Didi Tech
Oct 19, 2023 · Cloud Native

Design and Implementation of a New Tiered Resource Guarantee System for Elastic Cloud Containers

The new tiered resource‑guarantee system for Didi’s elastic cloud containers defines S, A, and B priority levels with explicit over‑commit rules, upgrades OS, Kubernetes, kube‑odin, service‑tree, and CMP components, and thereby cuts CPU contention by up to 80%, reduces latency, improves scaling reliability, and lowers operational costs.

Container ManagementKubernetesOvercommit
0 likes · 16 min read
Design and Implementation of a New Tiered Resource Guarantee System for Elastic Cloud Containers
Didi Tech
Didi Tech
Oct 12, 2023 · Cloud Computing

Elastic Cloud Mixed Deployment: Architecture, Scheduling, Isolation, and Future Directions

Didi's Elastic Cloud uses mixed deployment to co‑locate diverse services, employing tiered guarantees, custom Kubernetes scheduling, profiling, rescheduling, and isolation‑cluster techniques to boost utilization while preserving QoS, with a roadmap for broader automation and interference detection.

Dynamic ScalingKubernetesResource Scheduling
0 likes · 25 min read
Elastic Cloud Mixed Deployment: Architecture, Scheduling, Isolation, and Future Directions
DataFunSummit
DataFunSummit
Aug 25, 2023 · Big Data

Big Data Meets Cloud Native: Tencent's Cloud‑Native Big Data Architecture, Challenges, and Practices

This article explores how Tencent integrates big data with cloud‑native technologies, detailing the evolution, opportunities, challenges, the peak‑range (FengLuan) architecture, engine and scheduling layers, mixed‑workload strategies, runtime optimizations, and future directions for large‑scale data platforms.

Big DataKubernetesResource Scheduling
0 likes · 17 min read
Big Data Meets Cloud Native: Tencent's Cloud‑Native Big Data Architecture, Challenges, and Practices
High Availability Architecture
High Availability Architecture
May 26, 2023 · Big Data

Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster Resource Scheduling

This article introduces Amiya, a self‑developed overcommit component that dynamically increases Yarn memory and vCore capacity on Bilibili's offline big‑data clusters, details its architecture, key implementation of overcommit, eviction and mixed‑deployment strategies, and evaluates its resource‑utilization impact.

Big DataCluster ManagementOvercommit
0 likes · 22 min read
Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster Resource Scheduling
DataFunTalk
DataFunTalk
May 25, 2023 · Artificial Intelligence

Optimizing Distributed Cache for Large-Scale Deep Learning Training with Alluxio and SiloD

This article examines the storage bottlenecks in large‑scale AI training, evaluates local‑disk and Alluxio‑based distributed caching strategies, proposes uniform cache eviction and replica‑aware global policies, and introduces the SiloD framework for coordinated compute‑storage scheduling to dramatically improve GPU utilization and overall cluster throughput.

AI trainingAlluxioCache Eviction
0 likes · 16 min read
Optimizing Distributed Cache for Large-Scale Deep Learning Training with Alluxio and SiloD
High Availability Architecture
High Availability Architecture
Apr 3, 2023 · Cloud Native

Design and Implementation of Punica: A One‑Stop, Unattended AI Inference Platform

The article describes Punica, a cloud‑native, function‑as‑a‑service platform that unifies content‑understanding inference services through a one‑stop portal and unattended operations, improving deployment speed, resource utilization, and reducing manual effort for AI model serving.

AI inferenceFaaSResource Scheduling
0 likes · 13 min read
Design and Implementation of Punica: A One‑Stop, Unattended AI Inference Platform
Baidu Geek Talk
Baidu Geek Talk
Mar 29, 2023 · Cloud Native

Punica: A Cloud‑Native Platform for Content Understanding Inference Services

Punica provides a cloud‑native, one‑stop platform that unifies Baidu’s content‑understanding inference services, automates testing, resource provisioning, and monitoring, and enables unattended, self‑healing operations with dynamic scaling and GPU scheduling, cutting onboarding time by half and reclaiming hundreds of GPUs.

AI inferenceInference PlatformResource Scheduling
0 likes · 14 min read
Punica: A Cloud‑Native Platform for Content Understanding Inference Services