Alibaba Cloud Infrastructure
Author

Alibaba Cloud Infrastructure

For uninterrupted computing services

353
Articles
0
Likes
936
Views
0
Comments
Recent Articles

Latest from Alibaba Cloud Infrastructure

100 recent articles max
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 28, 2025 · Cloud Native

Improving OSS Small‑File Access Performance with StrmVol Storage Volumes in Kubernetes

StrmVol storage volumes replace the FUSE‑based OSS mount with a virtual block device and kernel‑mode file system, dramatically reducing latency for massive small‑file reads in Kubernetes workloads such as AI training datasets, and the article demonstrates setup, configuration, and performance testing using Argo Workflows.

Argo WorkflowsCSIKubernetes
0 likes · 13 min read
Improving OSS Small‑File Access Performance with StrmVol Storage Volumes in Kubernetes
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 25, 2025 · Fundamentals

Alibaba Network Proposal OSFP MSA Passes Unanimously, Introducing the First Liquid‑Cooled OSFP Cage Standard

Alibaba Cloud’s infrastructure network team’s split‑type OSFP Cage proposal was unanimously approved by the OSFP MSA committee, becoming the first standard supporting liquid‑cooled OSFP cold plates, offering low‑cost, easy‑assembly solutions that address the growing power‑consumption challenges of high‑density AI switches.

AI SwitchesHardware StandardLiquid cooling
0 likes · 5 min read
Alibaba Network Proposal OSFP MSA Passes Unanimously, Introducing the First Liquid‑Cooled OSFP Cage Standard
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 18, 2025 · Artificial Intelligence

Alibaba Cloud Showcases Optical Interconnect Innovations at OFC 2025 50th Anniversary

At the OFC 2025 50th anniversary in San Francisco, Alibaba Cloud presented cutting‑edge optical interconnect research and solutions for AI computing and modern data‑center networks, highlighted by invited talks, breakthrough demos, and two data‑driven QoT estimation papers co‑authored with Hong Kong Polytechnic University.

AI computingData CenterPhotonic Integration
0 likes · 6 min read
Alibaba Cloud Showcases Optical Interconnect Innovations at OFC 2025 50th Anniversary
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 17, 2025 · Cloud Native

OpenKruise 1.8 Release Highlights: In‑Place VPA, StatefulSet Volume Expansion, AI WorkloadSpread, Serverless Probe, SidecarSet Gray‑Release, and Helm Pre‑Delete Hook

OpenKruise 1.8, the latest CNCF‑incubated cloud‑native automation suite, introduces in‑place vertical pod autoscaling, native StatefulSet volume expansion, AI‑aware WorkloadSpread, serverless probe support, sidecar gray‑release capabilities, and a Helm pre‑delete safety hook, all backed by detailed YAML examples and future roadmap.

InPlaceVPAKubernetesOpenKruise
0 likes · 13 min read
OpenKruise 1.8 Release Highlights: In‑Place VPA, StatefulSet Volume Expansion, AI WorkloadSpread, Serverless Probe, SidecarSet Gray‑Release, and Helm Pre‑Delete Hook
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 16, 2025 · Artificial Intelligence

Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM

This article presents a step‑by‑step guide for deploying and optimizing large‑language‑model inference across multiple GPU‑enabled nodes using ACK Gateway with Inference Extension, vLLM’s tensor‑ and pipeline‑parallel techniques, and Kubernetes resources such as LeaderWorkerSet, PVCs, and custom routing policies, followed by performance benchmarking and analysis.

ACK GatewayDistributed inferenceKubernetes
0 likes · 19 min read
Optimizing Multi‑Node Distributed LLM Inference with ACK Gateway and vLLM
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 14, 2025 · Operations

Process Hotspot Tracing and Performance Analysis with Sysom

This article explains the concept of process hotspot tracing, analyzes common performance pain points in cloud‑native environments, and details Sysom's solution—including stack unwinding, symbol resolution, flame‑graph generation, and real‑world case studies—to help developers and operators quickly locate and resolve system bottlenecks.

Performance AnalysisSysOMeBPF
0 likes · 17 min read
Process Hotspot Tracing and Performance Analysis with Sysom
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 18, 2025 · Cloud Native

Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes

This guide explains how to deploy large language model inference services on a GPU-enabled Kubernetes cluster, configure ACK Gateway with AI Extension for intelligent routing and load balancing, and perform gray releases for both LoRA fine‑tuned models and base models such as QwQ‑32B and DeepSeek‑R1, including step‑by‑step commands and validation procedures.

ACK GatewayAI inferenceKubernetes
0 likes · 25 min read
Gray Release of LoRA and Base Models Using ACK Gateway with AI Extension on Kubernetes