Operations 7 min read

Boost Kubernetes API Server Performance: Tuning max-mutating-requests-inflight & watch-cache-size

This guide explains how to optimize Kubernetes API Server performance by configuring the max-mutating-requests-inflight limit and watch-cache-size, offering recommended values for different cluster sizes, monitoring metrics, and step‑by‑step adjustment strategies for stable, high‑throughput clusters.

Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Boost Kubernetes API Server Performance: Tuning max-mutating-requests-inflight & watch-cache-size

Limiting Mutating Requests: max-mutating-requests-inflight

Purpose

The max-mutating-requests-inflight flag limits the number of concurrent mutating (create, update, delete) requests the API Server will process. Mutating requests consume significant CPU and memory, so capping their concurrency helps prevent API Server overload and improves cluster stability.

Recommended Values

General recommendation: 500‑1000 concurrent requests per API Server instance.

Three‑node HA API Server cluster: roughly 3000 total (≈1000 per node).

Large‑scale cluster (≈5000 nodes) with five‑node HA API Server: about 5000 total (≈1000 per node).

Monitoring and Adjustment

Key metrics

API Server response latency.

Success rate of mutating requests.

Adjustment strategy

Start from the official defaults, then increase or decrease gradually while observing the metrics.

Re‑evaluate whenever the cluster size or workload characteristics change.

Improving Real‑time Watch: watch-cache-size

Watch Mechanism Overview

Traditional polling repeatedly queries the API Server, causing high latency and unnecessary load. The Watch mechanism maintains a long‑lived connection that pushes resource change events to the client in real time.

Watch Cache Role

Faster responses : Objects and events are cached in memory, avoiding frequent reads from etcd.

Higher cache‑hit rate : A larger cache can serve more watch requests directly from memory.

Parameter Definition

The watch-cache-size flag sets the maximum number of cached objects per resource type.

Small cache : Increases latency because more requests fall back to storage reads.

Large cache : Improves performance but consumes more RAM.

Recommended Configuration

Formula: watch-cache-size = node_count × 2 Small cluster (≈100 nodes): cache ~200 objects.

Large cluster (≈5000 nodes): cache ~10,000 objects.

Monitoring and Gradual Tuning

Key metrics

Watch response latency.

Cache‑hit rate.

Adjustment steps

Start with the formula‑derived value.

Observe the metrics under real load and tune incrementally to match actual demand.

Best Practices

Coordinate both flags: limit mutating concurrency while sizing the watch cache to balance CPU, memory, and I/O pressure.

Continuously monitor the listed metrics and adjust the settings dynamically as workloads evolve.

Seek an optimal trade‑off between performance gains and resource consumption for stable, high‑throughput clusters.

KubernetesPerformance TuningAPI ServerCluster operationsmax-mutating-requests-inflightwatch-cache-size
Full-Stack DevOps & Kubernetes
Written by

Full-Stack DevOps & Kubernetes

Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.