Operations 17 min read

Stability Quality Assurance: Definitions, Metrics, and Implementation Guide

This article explains the origins and meaning of software stability and stability testing, outlines key standards such as GB/T 16260 and industry definitions, and presents a comprehensive framework for stability quality assurance covering system elements, external disturbances, baseline setting, robust design, monitoring, and rapid incident response.

NetEase Smart Enterprise Tech+

Mar 1, 2023

Stability Quality Assurance: Definitions, Metrics, and Implementation Guide

What Is Stability?

According to Baidu Baike, stability means “steady and fixed; without change.” In software, stability is a relative state, similar to two cars moving at the same speed in the same direction, achieving a balanced and stable condition.

When comparing software to a car, functionality is the ability to run, performance is running at a certain speed and fuel consumption, and stability is the ability to run smoothly and continuously at that speed and consumption. Stability is not an inherent quality attribute but a manifestation of quality over time.

What Is Stability (Software)?

GB/T 16260 defines stability as “the ability of software to avoid unexpected results caused by software modifications.” The senior software exam describes it as the error rate and performance degradation trend under certain pressure conditions during a runtime cycle, also considering the stability of application and database servers.

The China Academy of Information and Communications Technology’s 2022 guide defines system stability as the system’s ability to return to its original equilibrium after external disturbances disappear.

In summary, system stability focuses on the system’s ability to remain unchanged despite software changes and external disturbances.

Key Concepts for Building Stability Quality Assurance

System Elements

System elements are the basic components or units that constitute a system. Risks within the software development process, such as human error, process gaps, technical defects, component interactions, and changes, can affect stability.

2022 Atlassian outage caused by communication gaps and insufficient system alerts.

2022 Google Search and Maps outage caused by software update errors.

External Disturbances

External disturbances are changes not caused by the system itself, such as earthquakes, cable cuts, or DDoS attacks. Although low probability, they occur worldwide and must be mitigated.

2022 Microsoft outage due to unexpected electrical transients in redundant power components.

2022 Oracle latency caused by record summer heat affecting cooling systems.

2022 AWS EC2 outage caused by a power failure in an availability zone.

Stable State

Stability is a relative state; therefore, it is evaluated against baselines and expectations. The latest national standard GB/T 25000.10‑2016 splits software usage quality into several impact paths, each with measurable indicators (e.g., product quality’s eight characteristics, SLA/SLO, MTTR).

Why Perform Quality Assurance?

Quality control and testing verify product outcomes, but addressing issues only after release is costly. Quality assurance (QA) aims to prevent problems early, reducing time and effort needed for later fixes.

Quality Assurance vs. Quality Control vs. Testing

Quality Assurance (QA): Planned and systematic activities to ensure products/services meet quality requirements.

Quality Control (QC): Technical and managerial activities to achieve quality requirements.

Testing: Executing programs under defined conditions to find defects and assess compliance.

The hierarchy is QA > QC > Testing, with QA focusing on prevention through “left‑shift” and “right‑shift” testing.

Stability Quality Assurance Framework

1. Build Robustness – Design for Failure

Address all lifecycle factors:

People: Reduce negative impacts from human errors.

Process: Avoid gaps or decay in procedures.

Technology: Mitigate risks from technical defects.

Components/Systems: Manage internal component relationships and architectural risks.

Change: Control risks introduced by changes.

2. Defend Against External Disturbances

Identify common external impacts (infrastructure issues, third‑party dependencies, abnormal user actions) and devise strategies for each impact scope and phase.

3. Timely Detection – Multiplex Monitoring

Monitor quality impact dimensions (process quality, product quality, usage quality) and user access chain layers (experience, application service, component, host, infrastructure).

4. Rapid Resolution – Emergency Response

Since 100 % reliability is impossible, focus on fast fault handling:

Fast Identification: Precisely locate the fault.

Fast Decision: Choose the appropriate solution.

Fast Execution: Implement the fix efficiently.

This reduces both fault occurrence probability and impact, aligning with goals to increase MTBF and decrease MTTR.

Key Metrics for Incident Management

MTTI (Mean Time to Identify): Average time to detect a service/component issue.

MTTK (Mean Time to Know): Average time to determine the root cause.

MTTF (Mean Time to Fix): Average time to resolve the issue.

MTTV (Mean Time to Verify): Average time to confirm the fix.

In practice, MTTK and MTTF consume most of MTTR.

Enabling Tools and Personnel

Tools such as observability platforms (metrics, logs, distributed tracing) improve fault focus, while skill development (drills, rehearsals) enhances manual investigation capabilities.

Emergency Preparedness

Develop and regularly rehearse emergency plans, build fast‑recovery platforms, and standardize response processes to ensure coordinated, orderly fault recovery.

Stability Quality Assurance Evolution in Cloud Commerce

The journey includes four stages:

Quality Control: Focus on testing to discover issues.

Quality Built‑In: Improve MTBF through process checkpoints and architectural changes.

Risk Prevention: Treat risk‑to‑fault as a continuum, emphasizing multi‑dimensional inspection and alerts.

Fault Management: Close the loop on incidents, emphasizing rapid identification, resolution, and verification to reduce MTTR.

After years of iteration, the stability quality assurance system is largely formed.

Conclusion

This article interprets the origins and construction directions of stability quality assurance from a conceptual standpoint, without delving into specific implementation details. Readers seeking deeper insight are encouraged to engage further.

operations software reliability quality assurance SRE stability

Written by

NetEase Smart Enterprise Tech+

Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.