Cloud Native 15 min read

Investigation and Resolution of Kubernetes API Server and Etcd Performance Bottlenecks in the 58 Cloud Platform

The article analyzes a slowdown issue in the 58 Cloud Platform caused by an overloaded API Server and uneven Etcd load, explains the root causes—including load‑balancing failure and missing namespace segmentation—and presents concrete remediation steps such as DNS round‑robin, namespace partitioning, Etcd client upgrade, and cache‑enabled queries.

58 Tech

Jun 21, 2019

Investigation and Resolution of Kubernetes API Server and Etcd Performance Bottlenecks in the 58 Cloud Platform

The 58 Cloud Platform is an internal service‑instance management system built on Kubernetes and Docker, designed for lightweight, efficient resource utilization and standardized deployment. In production, users reported slow container deployments and overall platform latency.

Problem Overview

Investigation revealed that one API Server instance (Server A) handled the majority of requests, leading to high CPU usage (up to 2000%) and network saturation, while the other two API Servers remained underutilized. Correspondingly, a single Etcd node (E1) exhibited excessive load and frequent time‑outs.

Kubernetes Basics

Kubernetes orchestrates containers via components such as Etcd (backend store), Pods, ReplicationController, Scheduler, Kubelet, and API Server. The typical request flow is: user → API Server → Scheduler → Etcd → etc.

Root‑Cause Analysis

1. Load‑Balancing Failure : The Tencent Gateway (TGW) load‑balancer was only receiving successful heartbeats from Server A, causing all traffic to be directed to that API Server. Consequently, the API Server’s Etcd client (v2) consistently contacted a single Etcd endpoint, concentrating load.

2. Etcd Access Inefficiency : The API Server queried Etcd directly for every GET/List request because the resourceVersion parameter was not set, bypassing the cache. Additionally, all Pods were stored under the default namespace, forcing global scans of /registry/pods/default, which degraded performance as the cluster grew.

Key Code Snippets

<span>// Initialize Etcd client</span>
func New(cfg Config) (Client, error) {
    c := &httpClusterClient{clientFactory: newHTTPClientFactory(cfg.transport(), cfg.checkRedirect(), cfg.HeaderTimeoutPerRequest),
        rand: rand.New(rand.NewSource(int64(time.Now().Nanosecond()))),
        selectionMode: cfg.SelectionMode,
    }
    if err := c.SetEndpoints(cfg.Endpoints); err != nil {
        return nil, err
    }
    return c, nil
}

// Shuffle Etcd endpoints
func (c *httpClusterClient) SetEndpoints(eps []string) error {
    neps, err := c.parseEndpoints(eps)
    c.Lock()
    defer c.Unlock()
    c.endpoints = shuffleEndpoints(c.rand, neps) // randomize order
    c.pinned = 0
    return nil
}

func shuffleEndpoints(r *rand.Rand, eps []url.URL) []url.URL {
    p := r.Perm(len(eps))
    neps := make([]url.URL, len(eps))
    for i, k := range p {
        neps[i] = eps[k]
    }
    return neps
}

The client attempts requests starting from the "pinned" endpoint; on failure it moves to the next shuffled endpoint, meaning a healthy API Server will keep using the same Etcd node unless an error occurs.

Solution

Switch the load‑balancing strategy to DNS round‑robin to evenly distribute traffic across all API Server instances.

Partition Pods into multiple namespaces instead of the default namespace, reducing the size of each Etcd directory scan.

Upgrade the API Server to use the Etcd v3 client, which periodically reshuffles endpoints, preventing a single Etcd node from becoming a hotspot.

Enable caching by setting resourceVersion (e.g., to 0) in Get/List calls, allowing the API Server to serve data from its watch cache rather than hitting Etcd each time.

Conclusion

The case demonstrates that prolonged unnoticed configuration issues—such as an ineffective load‑balancer and lack of namespace segmentation—can accumulate and eventually cause severe performance degradation. Continuous monitoring of Kubernetes component metrics, deep familiarity with default policies, and source‑level understanding are essential for proactive stability in production environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance cloud-native Kubernetes Caching Namespace Etcd

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.