Operations 12 min read

Capacity Management: Goals, Practices, and Optimization at ZuanZuan

This article outlines ZuanZuan’s capacity management approach, covering its objectives, development stages, water‑level metrics, resource optimization techniques, cluster capacity assessment, stress‑test indicators and standards, as well as scaling strategies, demonstrating how systematic capacity management reduces costs while ensuring service stability.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Capacity Management: Goals, Practices, and Optimization at ZuanZuan

1 Background

As ZuanZuan’s business and user base continue to grow, the company has been investing heavily in hardware and infrastructure to meet demand. However, resource utilization has been declining because early efforts focused on rapid feature delivery rather than efficiency. With the business now mature, stability, performance, redundancy, and disaster recovery have become priorities, increasing resource costs. Capacity management is essential to maintain service quality and performance while reducing operational costs and improving resource utilization.

2 Goals of Capacity Management

Capacity management is defined by Baidu Baike as providing the required processing and storage capacity in an economical way at the right time. In essence, it balances risk and cost, ensuring stable service with minimal expense. The two main goals are:

Cost control: Provide the needed capacity and performance in the most cost‑effective manner and use resources efficiently.

Business support: Align with service level agreements (SLA) to guarantee continuous service, and use capacity planning to guide business and cost planning.

3 Development Stages

ZuanZuan’s capacity management has evolved through three stages:

Stage 1 – No capacity management: Services were mixed on physical machines and KVM VMs, causing resource contention.

Stage 2 – Analyzing availability and performance data, reducing mixed deployment, decommissioning KVM VMs, and adjusting configurations, which cut server count and saved about 50% of IT resource costs.

Stage 3 – Cloud era: With mature stability and performance data, plus defined stress‑test and utilization standards, capacity management further balanced cost and quality, achieving another ~50% cost reduction.

4 Capacity Management

4.1 Capacity Water Level

The capacity water level is the ratio of actual resource consumption (physical servers, cloud instances, SaaS services) to total available resources. For example, if Service B has four cloud instances but only two are active, its utilization is 50 % and the water level is 50 %. Collecting metadata such as CPU, memory, disk, NIC for cloud hosts and JVM memory, threads, GC frequency, QPS, and response time for application services enables multi‑dimensional analysis and optimization.

Peak traffic occurs between 20:00‑23:00, so capacity planning must consider water levels at these times.

4.2 Resource Capacity Optimization

After identifying low water levels, resource waste can be reduced:

Service configuration reduction – Service A originally had 4 CPU cores and 8 GB memory, with a peak CPU usage of 8 % and memory usage of 72 %. Keeping a 30 % redundancy margin, the CPU was reduced to 2 cores.

Memory formula – Service B’s container memory is 8 GB; applying the JVM memory formula (JVM total = heap + thread‑stack × thread‑count + constant overhead) suggests a 7 GB container is appropriate.

Mixed deployment strategies further improve utilization, e.g., deploying low‑load services together with high‑load services on the same host to achieve better CPU usage.

4.3 Cluster Capacity

Beyond simple water‑level assessment, accurate capacity is obtained by stress‑testing combined with water‑level data. Two methods are used: replaying logs or TCP‑Copy to simulate real traffic on a single instance, and full‑cluster stress tests to derive per‑instance capacity. ZuanZuan adopts the cluster‑level stress test for higher accuracy.

4.4 Stress‑Test Indicators

Stress‑test metrics are divided into system indicators (CPU, memory, disk I/O, network bandwidth) and service indicators (response time, latency percentiles, error rate, slow‑request ratio).

4.5 Stress‑Test Standards

Resource bottlenecks manifest as increased response time and error rate. The testing standard defines a critical point where performance degrades sharply. Specific thresholds include:

Error rate: ≤1 % for A‑level services, ≤3 % for B‑level, ≤5 % for C/D/E‑level.

Response time: Median ≤ 2× average, 90th percentile ≤ 5× average, 99th percentile must be at least twice the 90th percentile gap.

These standards help identify services that exceed capacity and require scaling.

5 Scaling (Expansion and Shrinkage)

Based on capacity data, ZuanZuan implements scheduled auto‑scaling during promotional events and elastic scaling for daily operations, ensuring service stability while optimizing resource usage.

6 Summary

Capacity management is a complex engineering discipline that requires clear strategies, defined processes, and continuous refinement to achieve cost reduction and efficiency. It underpins service stability and resource cost control, and with the maturation of intelligent operations, it drives the organization toward lower cost and higher quality goals.

OperationsResource optimizationPerformance Monitoringcost optimizationscalingcapacity-management
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.