Cloud Native 15 min read

Investigation and Resolution of Kubernetes API Server and Etcd Performance Bottlenecks in the 58 Cloud Platform

The article analyzes a slowdown issue in the 58 Cloud Platform caused by an overloaded API Server and uneven Etcd load, explains the root causes—including load‑balancing failure and missing namespace segmentation—and presents concrete remediation steps such as DNS round‑robin, namespace partitioning, Etcd client upgrade, and cache‑enabled queries.

58 Tech
58 Tech
58 Tech
Investigation and Resolution of Kubernetes API Server and Etcd Performance Bottlenecks in the 58 Cloud Platform

The 58 Cloud Platform is an internal service‑instance management system built on Kubernetes and Docker, designed for lightweight, efficient resource utilization and standardized deployment. In production, users reported slow container deployments and overall platform latency.

Problem Overview

Investigation revealed that one API Server instance (Server A) handled the majority of requests, leading to high CPU usage (up to 2000%) and network saturation, while the other two API Servers remained underutilized. Correspondingly, a single Etcd node (E1) exhibited excessive load and frequent time‑outs.

Kubernetes Basics

Kubernetes orchestrates containers via components such as Etcd (backend store), Pods, ReplicationController, Scheduler, Kubelet, and API Server. The typical request flow is: user → API Server → Scheduler → Etcd → etc.

Root‑Cause Analysis

1. Load‑Balancing Failure : The Tencent Gateway (TGW) load‑balancer was only receiving successful heartbeats from Server A, causing all traffic to be directed to that API Server. Consequently, the API Server’s Etcd client (v2) consistently contacted a single Etcd endpoint, concentrating load.

2. Etcd Access Inefficiency : The API Server queried Etcd directly for every GET/List request because the resourceVersion parameter was not set, bypassing the cache. Additionally, all Pods were stored under the default namespace, forcing global scans of /registry/pods/default , which degraded performance as the cluster grew.

Key Code Snippets

// Initialize Etcd client
func New(cfg Config) (Client, error) {
    c := &httpClusterClient{clientFactory: newHTTPClientFactory(cfg.transport(), cfg.checkRedirect(), cfg.HeaderTimeoutPerRequest),
        rand: rand.New(rand.NewSource(int64(time.Now().Nanosecond()))),
        selectionMode: cfg.SelectionMode,
    }
    if err := c.SetEndpoints(cfg.Endpoints); err != nil {
        return nil, err
    }
    return c, nil
}

// Shuffle Etcd endpoints
func (c *httpClusterClient) SetEndpoints(eps []string) error {
    neps, err := c.parseEndpoints(eps)
    c.Lock()
    defer c.Unlock()
    c.endpoints = shuffleEndpoints(c.rand, neps) // randomize order
    c.pinned = 0
    return nil
}

func shuffleEndpoints(r *rand.Rand, eps []url.URL) []url.URL {
    p := r.Perm(len(eps))
    neps := make([]url.URL, len(eps))
    for i, k := range p {
        neps[i] = eps[k]
    }
    return neps
}

The client attempts requests starting from the "pinned" endpoint; on failure it moves to the next shuffled endpoint, meaning a healthy API Server will keep using the same Etcd node unless an error occurs.

Solution

Switch the load‑balancing strategy to DNS round‑robin to evenly distribute traffic across all API Server instances.

Partition Pods into multiple namespaces instead of the default namespace, reducing the size of each Etcd directory scan.

Upgrade the API Server to use the Etcd v3 client, which periodically reshuffles endpoints, preventing a single Etcd node from becoming a hotspot.

Enable caching by setting resourceVersion (e.g., to 0) in Get/List calls, allowing the API Server to serve data from its watch cache rather than hitting Etcd each time.

Conclusion

The case demonstrates that prolonged unnoticed configuration issues—such as an ineffective load‑balancer and lack of namespace segmentation—can accumulate and eventually cause severe performance degradation. Continuous monitoring of Kubernetes component metrics, deep familiarity with default policies, and source‑level understanding are essential for proactive stability in production environments.

performancecloud nativeKubernetesLoad BalancingcachingnamespaceETCD
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.