Boost Kubernetes API Server Performance: Tuning max-mutating-requests-inflight & watch-cache-size
This guide explains how to optimize Kubernetes API Server performance by configuring the max-mutating-requests-inflight limit and watch-cache-size, offering recommended values for different cluster sizes, monitoring metrics, and step‑by‑step adjustment strategies for stable, high‑throughput clusters.
Limiting Mutating Requests: max-mutating-requests-inflight
Purpose
The max-mutating-requests-inflight flag limits the number of concurrent mutating (create, update, delete) requests the API Server will process. Mutating requests consume significant CPU and memory, so capping their concurrency helps prevent API Server overload and improves cluster stability.
Recommended Values
General recommendation: 500‑1000 concurrent requests per API Server instance.
Three‑node HA API Server cluster: roughly 3000 total (≈1000 per node).
Large‑scale cluster (≈5000 nodes) with five‑node HA API Server: about 5000 total (≈1000 per node).
Monitoring and Adjustment
Key metrics
API Server response latency.
Success rate of mutating requests.
Adjustment strategy
Start from the official defaults, then increase or decrease gradually while observing the metrics.
Re‑evaluate whenever the cluster size or workload characteristics change.
Improving Real‑time Watch: watch-cache-size
Watch Mechanism Overview
Traditional polling repeatedly queries the API Server, causing high latency and unnecessary load. The Watch mechanism maintains a long‑lived connection that pushes resource change events to the client in real time.
Watch Cache Role
Faster responses : Objects and events are cached in memory, avoiding frequent reads from etcd.
Higher cache‑hit rate : A larger cache can serve more watch requests directly from memory.
Parameter Definition
The watch-cache-size flag sets the maximum number of cached objects per resource type.
Small cache : Increases latency because more requests fall back to storage reads.
Large cache : Improves performance but consumes more RAM.
Recommended Configuration
Formula: watch-cache-size = node_count × 2 Small cluster (≈100 nodes): cache ~200 objects.
Large cluster (≈5000 nodes): cache ~10,000 objects.
Monitoring and Gradual Tuning
Key metrics
Watch response latency.
Cache‑hit rate.
Adjustment steps
Start with the formula‑derived value.
Observe the metrics under real load and tune incrementally to match actual demand.
Best Practices
Coordinate both flags: limit mutating concurrency while sizing the watch cache to balance CPU, memory, and I/O pressure.
Continuously monitor the listed metrics and adjust the settings dynamically as workloads evolve.
Seek an optimal trade‑off between performance gains and resource consumption for stable, high‑throughput clusters.
Full-Stack DevOps & Kubernetes
Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
