Cloud Native 11 min read

How ByteDance’s Katalyst Enables Cloud‑Native Mixed Workloads and Cuts Costs

This article explains ByteDance’s cloud‑native mixed‑workload strategy, describing how the open‑source Katalyst system leverages Kubernetes and Yarn to dynamically share idle resources between online and offline services, improve utilization from 23% to 60%, and reduce operational costs.

Volcano Engine Developer Services

Apr 6, 2023

How ByteDance’s Katalyst Enables Cloud‑Native Mixed Workloads and Cuts Costs

ByteDance Cloud‑Native Mixed‑Workload Practice

Internet applications experience daily resource usage fluctuations, creating a tidal pattern where peak‑time resources are often idle during off‑peak periods. By temporarily allocating these idle resources to lower‑priority services and returning them when needed, ByteDance achieves peak‑shaving and cost savings.

Challenges of Traditional Resource Pools

ByteDance’s massive, diverse services (microservices, search, machine learning, big data, storage) each have distinct infrastructure demands. Traditional siloed resource pools lead to resource islands, low utilization, higher operational burden, and hindered cost optimization.

Elastic Scaling and Mixed Deployment

Since 2016, ByteDance has built a unified Kubernetes‑based infrastructure. Their practice evolved into two complementary mechanisms:

Elastic Scaling : Enables machine‑level and NUMA‑level time‑sharing of resources, guided by business and system metrics to dynamically adjust horizontal and vertical scaling, allowing offline services to acquire cheap idle resources and online services to secure premium peak resources.

Mixed Deployment : Provides resource oversubscription, leveraging unsold but unused cluster capacity for low‑priority workloads while maintaining multi‑dimensional isolation (CPU, memory, disk, network) and predictive load awareness to ensure stability.

The solution integrates Kubernetes and Yarn control planes, running both on each node and using a central coordinator to allocate visible resources across the two systems. Real‑time resource estimation based on service profiles ensures SLA compliance while improving flexibility.

In practice, this approach raised whole‑machine daily utilization from 23% to 60% .

Katalyst: From Internal Validation to Open Source

After extensive validation in large‑scale ByteDance services (Douyin, Toutiao), the team open‑sourced the resource‑control system as Katalyst, a Kubernetes‑native solution that acts as a catalyst for automated resource management.

What is Katalyst? It originates from ByteDance’s mixed‑workload practice and extends resource control, scheduling, and management capabilities. Key features include:

Born from ultra‑large‑scale mixed‑workload practice, fully integrated into ByteDance’s cloud‑native transformation.

Built on ByteDance’s Enhanced Kubernetes distribution, ensuring compatibility and access to internal core functions.

Plugin‑based architecture allowing custom scheduling, control, and policy modules.

One‑click deployment templates and comprehensive operation manuals to lower adoption barriers.

Resource Abstraction in Katalyst

Katalyst refines Kubernetes QoS by introducing multiple CPU core classes (system_core, dedicated_core, shared_core, reclaimed_core) and enhancement mechanisms (NUMA binding, network affinity, bandwidth limits) to enable differentiated resource allocation.

Through this abstraction, users map services to appropriate QoS and selling models, obtaining resources from a unified pool without dealing with underlying pool details.

Architecture Design

The original mixed‑workload architecture combined Kubernetes and Yarn, causing high maintenance cost and resource overhead. Katalyst consolidates control into a single Kubernetes‑based system while preserving separate API entry points for Kubernetes and Yarn.

In the scheduling layer, Katalyst implements both centralized and node‑level coordination. Node‑side scheduling uses an extended QoS Resource Manager (QRM) with plugin‑driven micro‑topology awareness and CRD reporting. Central scheduling extends the native Scheduler Framework to consider cross‑QoS resource collaboration and dynamic rebalance across the cluster.

This unified control reduces resource waste, eliminates asynchronous race conditions, and enables seamless API‑driven extensions, achieving “internal‑external source parity”.

RoadMap

Katalyst’s core scenario is offline mixed‑workload, with planned QoS capabilities such as fine‑grained resource leasing strategies, multi‑dimensional isolation (cgroup, rdt, iocost, tc), and hierarchical load eviction policies. Additional enhancements include elastic HPA/VPA, micro‑topology‑aware scheduling, and more.

For detailed plans, see the roadmap documentation.

ByteDance invites the community to join the Katalyst open‑source project.

Project repository: github.com/kubewharf/katalyst-core

cloud-native Kubernetes resource management mixed workloads Katalyst

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.