Cloud Native 19 min read

How Katalyst Memory Advisor Optimizes Kubernetes Memory Management in Mixed Workloads

This article explains the challenges of memory management in mixed Kubernetes workloads, introduces ByteDance's open‑source Katalyst Memory Advisor, details native allocation and reclamation mechanisms, outlines its architecture and plugins, and describes interference detection and multi‑level mitigation strategies to improve memory utilization and service quality.

ByteDance Cloud Native
ByteDance Cloud Native
ByteDance Cloud Native
How Katalyst Memory Advisor Optimizes Kubernetes Memory Management in Mixed Workloads

Katalyst is ByteDance's open‑source cost‑optimization system that addresses inefficient resource usage in cloud‑native environments by providing solutions for resource management and cost reduction.

In mixed workloads, memory management is critical: memory pressure can cause latency jitter or OOM, while unused memory that is not released reduces the amount available for offline jobs, limiting effective overselling.

ByteDance distilled extensive experience from large‑scale offline mixed workloads into a user‑space Kubernetes memory management solution called Memory Advisor , which is open‑sourced in Katalyst.

Limitations of Native Solutions

Native kernel memory allocation and reclamation mechanisms consist of fast and slow paths. Fast allocation checks if the free water‑mark would drop below the low watermark and may trigger a quick reclamation before allocating; otherwise it falls back to the slow path, which wakes kswapd for asynchronous reclamation, attempts compacting, performs global direct reclamation, and finally may trigger OOM.

Memory reclamation can target Memcg or Zone levels, including Memcg direct reclamation, global quick reclamation (only the pages needed for the current allocation), global asynchronous reclamation (triggered when free memory reaches the low watermark), and global direct reclamation (triggered at the min watermark, which is synchronous and impacts performance).

Kubernetes Native Memory Management

Memory limits are set via the memory.limit_in_bytes cgroup interface, and eviction occurs when a node is under memory pressure, marked by the node.kubernetes.io/memory-pressure taint. Eviction triggers when the node's working set exceeds a threshold, with sorting based on request over‑usage, priority, and request‑usage delta.

OOM is triggered if direct reclamation cannot satisfy memory demand; kubelet configures /proc/<pid>/oom_score_adj based on QoS: Critical/Guaranteed pods get -997, BestEffort get 1000, and Burstable pods compute a score using

min{max[1000 - (1000 * memoryRequest) / memoryCapacity, 1000 + guaranteedOOMScoreAdj], 999}

.

Memory QoS (available since v1.22 with cgroup v2) guarantees memory requests and ensures fair reclamation across pods. Configuration includes memory.min (based on requests.memory), memory.high (based on limits.memory or node allocatable memory multiplied by a throttling factor), and memory.max (based on limits.memory or node allocatable memory).

Improvements in v1.27 address issues where memory.high may be ineffective or overly aggressive, and adjust the default throttling factor from 0.8 to 0.9.

Limitations of Native Mechanisms

Global reclamation lacks fairness across pods.

Global reclamation lacks priority awareness, harming high‑priority online containers.

Native eviction may trigger late, after global reclamation.

Memcg direct reclamation can cause latency spikes.

Katalyst Memory Advisor Architecture

The architecture follows a plug‑in design with a framework plus plugins, making it easy to extend functionality.

Katalyst Agent runs on each node and includes modules such as:

Eviction Manager – extends kubelet eviction with periodic plugin calls.

Memory Eviction Plugins – implement eviction strategies.

Memory QRM Plugin – manages Memcg configuration and cache dropping.

SysAdvisor – algorithm module supporting extensible strategies.

Reporter – reports memory‑pressure taints.

MetaServer – provides pod/container metadata and dynamic config.

Malachite collects metrics at node, NUMA, and container levels.

Katalyst Scheduler is the central scheduler with its own plugins.

Interference Detection

Memory Advisor periodically detects interference across multiple dimensions:

Machine and NUMA memory watermarks vs. low watermark.

Kswapd reclamation rate.

Pod‑level RSS over‑use.

QoS‑level memory satisfaction using reclaimed_cores.

Multi‑Level Mitigation Measures

Based on the severity of detected anomalies, Memory Advisor applies mitigation actions:

Disable Scheduling – adds a node taint to prevent new pods.

Tune Memcg – raises Memcg reclamation thresholds for selected victim pods.

Drop Cache – forces cache release via memory.force_empty (cgroup v1) or memory.reclaim (cgroup v2). Example: echo 0 > memory.force_empty Eviction – evicts pods based on QoS, priority, and memory usage.

Eviction sorting defaults to QoS level, then priority, then memory usage.

Offline Large‑Frame Management

Memory Advisor limits the total memory usage of offline pods (the “large frame”) by adjusting memory.limit_in_bytes via the Memory Guard plugin.

Memory Migration

For NUMA‑aware workloads (e.g., Flink), Memory Advisor monitors NUMA watermarks and dynamically migrates containers to balance memory pressure and avoid hotspots.

Memcg Differential Reclamation

Using veLinux's asynchronous Memcg reclamation, pods can specify custom reclamation watermarks via annotations to favor either page‑cache usage or aggressive reclamation.

OOM Priority Enhancement

Memory Advisor configures oom_score_adj for containers across QoS levels to ensure offline pods are killed before online pods under memory pressure.

Cold Memory Offloading (Future Work)

Inspired by Meta's Transparent Memory Offloading, Memory Advisor will use PSI and DAMON to detect cold memory and offload it to storage or compress it with zRAM, improving overall memory utilization.

Summary

Deployed on over 900,000 nodes at ByteDance, Katalyst manages tens of millions of cores, boosting daily resource utilization from 20% to 60% while maintaining stability across microservices, search, storage, big data, and AI workloads. Future iterations will further enhance cold memory offloading, memory migration, and other advanced techniques.

cloud-nativememory managementKubernetesResource OptimizationKatalyst
ByteDance Cloud Native
Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.