Alibaba Cloud Infrastructure
Author

Alibaba Cloud Infrastructure

For uninterrupted computing services

353
Articles
0
Likes
936
Views
0
Comments
Recent Articles

Latest from Alibaba Cloud Infrastructure

100 recent articles max
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 17, 2025 · Cloud Native

AI Training Revives Gang Scheduling in Kubernetes for Elastic Resource Orchestration

The article examines how the rise of large‑model AI training reintroduces the need for gang scheduling in Kubernetes, contrasting the rigid resource requirements of HPC‑style workloads with cloud‑native elasticity, and outlines the historical evolution, current implementations, and future directions for achieving more flexible, high‑throughput compute orchestration.

AI trainingGang SchedulingKubernetes
0 likes · 22 min read
AI Training Revives Gang Scheduling in Kubernetes for Elastic Resource Orchestration
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 9, 2025 · Cloud Native

How to Detect and Resolve Kernel Memory & CPU Latency in Kubernetes Clusters

In cloud‑native Kubernetes environments, resource over‑commit and mixed deployments can cause kernel‑level memory reclaim and CPU scheduling delays that manifest as application jitter, and this article explains how to visualize, diagnose, and remediate those delays using the SysOM exporter and related metrics.

CPU schedulingKernel latencyKubernetes
0 likes · 13 min read
How to Detect and Resolve Kernel Memory & CPU Latency in Kubernetes Clusters
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 25, 2025 · Operations

How to Uncover Hidden Java Memory Leaks in Kubernetes Pods

This article explains why Java applications in cloud containers often encounter OOMKilled pods, details the hidden memory consumption from JNI, libc, and Transparent Huge Pages, and demonstrates step‑by‑step how to use Alibaba Cloud OS Console's memory panorama analysis to identify and mitigate the root causes.

KubernetesMemory LeakPod OOM
0 likes · 11 min read
How to Uncover Hidden Java Memory Leaks in Kubernetes Pods
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 12, 2025 · Operations

How Alibaba Cloud’s One‑Click IO Diagnosis Solves Multi‑Tenant Performance Bottlenecks

The article explains how Alibaba Cloud’s OS console implements a one‑click IO diagnostic that automatically detects, classifies, and resolves high‑latency, burst, and iowait IO issues in multi‑tenant cloud environments by using dynamic thresholds, periodic metric collection, and targeted root‑cause analysis.

Alibaba CloudIO diagnosticsPerformance Monitoring
0 likes · 11 min read
How Alibaba Cloud’s One‑Click IO Diagnosis Solves Multi‑Tenant Performance Bottlenecks
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 10, 2025 · Cloud Native

Koordinator v1.7.0 Brings Network‑Aware Scheduling and Job‑Level Preemption for AI Workloads

Koordinator v1.7.0, the open‑source Kubernetes scheduler, adds network‑topology‑aware scheduling, job‑level preemption, and support for Ascend NPU and Cambricon MLU, delivering unified heterogeneous device management, enhanced GPU sharing, comprehensive API documentation, and best‑practice guides to improve large‑scale AI training efficiency and cluster operations.

AI trainingHeterogeneous DevicesJob Preemption
0 likes · 17 min read
Koordinator v1.7.0 Brings Network‑Aware Scheduling and Job‑Level Preemption for AI Workloads
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 5, 2025 · Cloud Computing

How Alibaba Cloud Accelerated IPv6 Adoption to Power AI-Driven Networks

At the 4th China IPv6 Innovation Development Conference, Alibaba Cloud showcased its rapid IPv6 deployment—boosting IDC traffic share from 12% to 40%, expanding cloud product support to 97%, and leveraging IPv6/SRv6 to meet AI large‑model network demands, while sharing best practices across industry and government partners.

AIAlibaba CloudIPv6
0 likes · 5 min read
How Alibaba Cloud Accelerated IPv6 Adoption to Power AI-Driven Networks