Tag

GPU scheduling

0 views collected around this technical thread.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
May 15, 2025 · Cloud Native

How 360’s AI Platform Boosted GPU Utilization with Volcano Scheduler

360’s AI platform migrated its GPU clusters to a cloud‑native architecture and adopted the Volcano scheduler, achieving over 45% GPU utilization, less than 7% fragmentation, and more than 1000000 scheduled Pods, while leveraging flexible plugins, hierarchical queues, and resource pooling to optimize AI and big‑data workloads.

AI PlatformCloud NativeGPU scheduling
0 likes · 13 min read
How 360’s AI Platform Boosted GPU Utilization with Volcano Scheduler
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 4, 2025 · Cloud Native

Koordinator v1.6 Release: Advanced Heterogeneous Device Scheduling and GPU Management Features

The Koordinator v1.6 release introduces a suite of innovations—including GPU topology‑aware scheduling, end‑to‑end GPU & RDMA joint allocation, strong GPU isolation, differentiated GPU scoring, fine‑grained resource reservation, mixed‑workload QoS, and extensive scheduler and rescheduler optimizations—to efficiently manage heterogeneous resources in Kubernetes clusters for AI and high‑performance computing workloads.

Cloud NativeGPU schedulingHeterogeneous Resources
0 likes · 24 min read
Koordinator v1.6 Release: Advanced Heterogeneous Device Scheduling and GPU Management Features
Java Tech Enthusiast
Java Tech Enthusiast
Jan 9, 2025 · Cloud Native

Configuring NVIDIA Docker Plugin and GPU Access in Kubernetes

This guide walks through installing the NVIDIA container toolkit, configuring Docker to use the NVIDIA runtime, verifying GPU access, deploying the NVIDIA device plugin in Kubernetes, labeling GPU nodes, and running a GPU‑accelerated FFmpeg pod to confirm successful GPU integration.

Container ToolkitDockerGPU
0 likes · 12 min read
Configuring NVIDIA Docker Plugin and GPU Access in Kubernetes
ByteDance Cloud Native
ByteDance Cloud Native
Aug 9, 2023 · Cloud Native

How Volcano Engine’s New GPU Sharing Scheduler Boosts AI Workloads by 500%

This article explains Volcano Engine's next‑generation GPU sharing scheduling technology, detailing the two‑layer scheduler, card‑level bin‑pack/spread strategies, system architecture, API definitions, and optimization algorithms that together increase GPU deployment density over 500% and improve utilization by more than 50% for AI workloads.

Cloud NativeGPU schedulingKubernetes
0 likes · 13 min read
How Volcano Engine’s New GPU Sharing Scheduler Boosts AI Workloads by 500%
Baidu Tech Salon
Baidu Tech Salon
Mar 29, 2023 · Artificial Intelligence

Punica System: Enhancing AI Inference Service Efficiency Through FaaS Architecture

The Punica system unifies AI inference development, testing, deployment, and maintenance on a FaaS‑based one‑stop platform that automates resource scheduling, self‑healing, and monitoring, supporting multiple frameworks and GPUs, thereby doubling onboarding speed, quintuple scaling efficiency, and reclaiming hundreds of GPU cards.

AI inferenceCloud NativeContainer Framework
0 likes · 13 min read
Punica System: Enhancing AI Inference Service Efficiency Through FaaS Architecture
DataFunSummit
DataFunSummit
Apr 26, 2022 · Artificial Intelligence

Elastic Distributed Training at Huya: Design, Implementation, and Results

This talk describes Huya’s elastic distributed training system, covering the motivation behind elasticity, its design using Kubernetes and ETCD for dynamic node registration and scaling, implementation details of the EFDL framework, performance evaluations on ResNet‑50, and the resulting benefits and future directions.

AI PlatformGPU schedulingHuya
0 likes · 11 min read
Elastic Distributed Training at Huya: Design, Implementation, and Results
DataFunTalk
DataFunTalk
Apr 23, 2022 · Artificial Intelligence

Elastic Distributed Training at Huya: Design, Implementation, and Results

This article describes Huya's elastic distributed training system, explaining why elasticity is needed, the architectural design using Kubernetes and ETCD, the dynamic scaling process, performance evaluations on ResNet‑50, and future improvements for more efficient and reliable AI model training.

AI PlatformGPU schedulingKubernetes
0 likes · 10 min read
Elastic Distributed Training at Huya: Design, Implementation, and Results
58 Tech
58 Tech
Nov 20, 2020 · Artificial Intelligence

Evolution and Practice of the 58.com AI Algorithm Platform (WPAI)

The article details the development, architecture, and optimization of 58.com’s AI algorithm platform (WPAI), covering its background, overall design, large‑scale distributed machine learning, deep‑learning platform features, inference performance enhancements, GPU resource scheduling improvements, and future directions.

AI PlatformGPU schedulingKubernetes
0 likes · 15 min read
Evolution and Practice of the 58.com AI Algorithm Platform (WPAI)
360 Tech Engineering
360 Tech Engineering
Nov 30, 2018 · Operations

Deploying nvidia-docker2 for GPU Workloads on Large‑Scale Kubernetes Clusters

This article details the practical steps to install nvidia-docker2, configure Docker’s runtime, enable GPU support via Kubernetes device plugins, and verify GPU scheduling on a large Kubernetes cluster, providing code snippets and best‑practice recommendations for production environments.

DockerGPUGPU scheduling
0 likes · 8 min read
Deploying nvidia-docker2 for GPU Workloads on Large‑Scale Kubernetes Clusters
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Sep 14, 2017 · Artificial Intelligence

Running TensorFlow on Kubernetes: A Practical Guide to Scalable AI Workloads

This article explains how to deploy TensorFlow on Kubernetes, addressing resource isolation, GPU scheduling, and distributed training challenges by introducing a custom TensorFlow‑on‑K8s system with client, task, and autospec modules, plus container design for reliable job execution.

AI deploymentGPU schedulingKubernetes
0 likes · 9 min read
Running TensorFlow on Kubernetes: A Practical Guide to Scalable AI Workloads