How ByteDance Scales AI Workloads with Ray, KubeRay, and Kueue
This article explains why Ray is popular among AI researchers, how ByteDance uses KubeRay to host Ray applications, and how Kueue manages and schedules RayJob workloads, covering Ray's architecture, KubeRay components, real-world use cases, and job scheduling strategies.
What is Ray
Ray originated from UC Berkeley's RISElab as a general‑purpose distributed programming framework that helps users quickly parallelize their programs. Ray Core provides low‑level distributed primitives such as remote func and remote class, while the higher‑level Ray AIR offers AI‑specific libraries.
Ray's GitHub repository now has over 27K stars, and its creator founded Anyscale to manage the open‑source community and commercial products. At the Ray Summit 2023, companies like OpenAI, Uber, Amazon, ByteDance, and Ant Financial reported using Ray. Anyscale also offers commercial LLM products built on Ray, emphasizing cost efficiency and ease of use.
Ray’s ecosystem breaks the traditional AI pipeline silos (Spark for data, Torch DDP/MPI for training, deployment services for inference) by allowing data processing, model training, and serving to be expressed within a single framework.
ByteDance KubeRay + Ray Application Practice
KubeRay Introduction
KubeRay is an open‑source Ray deployment integration toolkit led by ByteDance’s engineering team with contributions from AnyScale, Ant Financial, Microsoft, and others. It has become the de‑facto standard for deploying Ray on Kubernetes.
Deploying Ray directly on physical machines requires manual IP/port configuration, complex scaling, and lacks Kubernetes‑native monitoring, alerting, Ingress, HPA/VPA, etc.
RayCluster
RayCluster is a custom resource definition (CRD) that builds and manages a Ray cluster. It provides pod recovery, cluster‑level hot updates, and integrates with the Ray autoscaler for dynamic scaling based on load, reducing cost while maintaining high availability.
RayJob
RayJob is a CRD for submitting and tracking jobs on a companion Ray cluster. It supports batch scheduling, creates or reuses clusters, updates job status, and cleans up clusters after completion. ByteDance added timeout handling and node‑count waiting features.
RayService
RayService deploys Ray Serve applications to a cloud‑native environment, exposing the serve agent via a Service for seamless traffic routing and supporting hot updates through rolling cluster updates.
Ray Hosting at ByteDance
All internal Ray clusters are managed by KubeRay. ByteDance extends the open‑source version to support large‑scale job submission, persistent clusters for debugging, and single‑job RayJob hosting. The platform also provides authentication, history server, notebook integration, and other surrounding capabilities.
ByteDance workloads span graph computing, offline inference, large‑model training, and parallel computation across both offline and online scenarios.
Scenario Cases
Graph Computing
Ray Core is used to refactor ByteDance’s internal graph engine. Each graph operator runs as a Ray Actor and communicates via MPI based on rank. Ray’s distributed capabilities and KubeRay’s orchestration provide end‑to‑end fault tolerance, automatically restarting failed workers and restoring checkpoints from persistent storage.
Large‑Scale Offline Inference
Ray Dataset’s streaming inference is employed for massive offline inference jobs that require high throughput and resource utilization but tolerate higher latency. Compared with Spark, Ray offers more flexible programming, enabling pipeline parallelism and model parallelism, along with actor‑pool scaling and end‑to‑end fault tolerance.
Kueue Managing / Scheduling RayJob
Kueue is a Kubernetes‑native job management and scheduling framework that provides queue‑based scheduling with priority, preemption, and quota support. It natively handles BatchJob, RayJob, and TFJob types.
Kueue’s architecture includes ResourceFlavor (node abstraction), ClusterQueue (resource pool), LocalQueue (shared pool), and Cohort (cross‑cluster resource sharing). Administrators define these resources, and users submit jobs to a specific LocalQueue. Jobs wait in a pending state until quota and priority conditions are met, optionally triggering cluster autoscaling.
Demo at KubeCon showed RayJob preemption and recovery across queues with different priorities.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
