Operations 20 min read

How KuJiaLe Built a Scalable Stability System: Real‑World SRE Lessons

This article shares KuJiaLe's experience tackling stability challenges caused by rapid user growth and system complexity, detailing their organizational, process, cultural, and technical approaches—including goal setting, a stability committee, monitoring, incident response, change control, and regular drills—to achieve measurable improvements in reliability and performance.

Qunhe Technology Quality Tech

Oct 13, 2023

How KuJiaLe Built a Scalable Stability System: Real‑World SRE Lessons

Background

This sharing session discusses the stability problems and challenges faced by KuJiaLe, the work ideas for ensuring stability, construction practice, guarantee system, and valuable experiences. Stability work is complex, and the goal is to explore best practices together.

1. Problems and Challenges

As user volume and system complexity increase, KuJiaLe's stability construction becomes more difficult. Analysis of historical failures shows that functional defects, design defects, and process issues account for nearly 80% of incidents, including many legacy debts and new fault types. After careful retrospectives, the main problems are:

Capability issues

Awareness issues

Process mechanisms

Compared with large companies that have built mature stability systems—cloud‑native observability, rapid recovery, platform construction, intelligent and digital operations, and comprehensive processes—KuJiaLe lags in monitoring speed, fault detection rate, recovery methods, platform integration, resource allocation, and clear metrics.

2. KuJiaLe's Stability Work Approach

The approach tackles the above problems from four aspects: organization management, process construction, data operation & culture, and system & capability, progressing step by step in daily work.

1) Set annual stability goals (e.g., number of high‑priority faults, average recovery time, high‑priority alerts) from CTO down to business‑line technical directors, managers, and frontline engineers. The CTO authorizes a stability committee to supervise and track each business line, clarifying responsibilities.

2) Refine and optimize stability‑related processes, assign owners, define metrics, and promote practice so that processes truly operate. After piloting offline, embed the processes into IT systems to avoid wasted resources.

3) Extract core result indicators and key process indicators from the processes, regularly publish stability target data and operational metrics, and drive business‑line improvements, forming a culture of accountability.

4) Consolidate capabilities and experience into platform construction, gradually forming a systematic ability—from single‑point breakthroughs to comprehensive coverage—by focusing on change control, emergency handling, and other pain points.

3. Stability Organization Guarantee – Three‑Level Responsibility

Each product line and agile team must carry part of the stability indicators in their OKRs, with completion rates included in performance assessments.

Stability work requires cooperation among development, testing, SRE, operations, monitoring, middleware, and other roles, so organizational management must ensure each party can fulfill its duties while collaborating efficiently.

The primary responsibility lies with the business development teams. Business‑line CTOs, managers, and owners jointly bear stability results, forming a three‑level responsibility system: CTO → Development Manager → Application Owner/Frontline Engineer.

KuJiaLe also created a “Stability Committee” as a horizontal virtual organization, composed of elite members from various teams, authorized by the CTO and technical directors to manage daily stability work, including process formulation, supervision, accountability, and tracking.

4. Stability Culture Construction

With organizational guarantees in place, cultural awareness is cultivated through regular promotion, posters, weekly and monthly reports, and activities such as emergency drills, which improve team emergency response speed and problem‑locating ability.

Training and sharing are essential: new‑employee onboarding includes stability training, theory exams, and practical drills. Successful teams are encouraged to document best practices and share them across business lines.

Reward and penalty mechanisms are established: stability awards recognize outstanding individuals and teams for monitoring, drills, emergency handling, and post‑mortem work; penalties apply to violations of red‑line policies, reflected in performance assessments.

5. Capability Issues

Key capability pain points include alert governance, emergency handling, and change control.

5.1 Alert Governance

KuJiaLe experiences over 180 high‑priority alerts daily, putting pressure on developers and indicating an unhealthy system that needs continuous optimization.

Many alerts and inspection findings remain unresolved due to lack of follow‑up or difficulty in root‑cause analysis, becoming hidden fault risks.

The goal of monitoring and inspection is to proactively discover and solve problems before they cause incidents.

5.2 Emergency Handling

Problems: unclear responsibilities during emergencies, fragmented information across multiple groups, and scattered post‑mortem documentation.

Solution: a unified emergency process covering response, judgment, notification, group creation, escalation, resolution, and verification, with clear owners for each step.

5.3 Change Management

Frequent changes (average 350+ per day across 12+ systems) make fault localization difficult. KuJiaLe unified 95% of change systems into a change‑management platform, linked changes with alerts, inspections, and emergency analysis, and enforced change‑freeze controls during critical periods.

6. Monitoring, Inspection, and Incident Response

SRE and monitoring teams built a 24/7 monitoring system that aggregates high‑priority alerts, pushes them to a company‑wide monitoring group, and creates tasks for responsible developers. Daily reports summarize alert counts, key warnings, and business volume, driving timely follow‑up.

Automated daily inspections cover cloud servers, middleware, network, and applications; discovered anomalies generate prioritized tasks assigned to owners.

Incident response includes one‑click group creation and external calls to broadcast fault notices, continuous updates, and post‑mortem data entry into a unified system for analysis and improvement.

7. Drills and Practice Platform

KuJiaLe designs realistic fault scenarios, injects them into the environment, conducts weekly “raid” drills, scores performance, publishes results, and holds award ceremonies. The drill process includes preparation, fault injection, execution, and post‑mortem.

8. Results and Value

After implementing the systematic governance, high‑priority fault recovery time decreased by 30%, high‑priority alert accuracy exceeded 90%, inspection discovered and resolved over 100 issues, improvement measure completion reached 99%, and more than 90% of change systems were integrated into change control.

Other companies can learn from KuJiaLe’s experience in:

Organizational management with top‑down emphasis

Process construction with clear owners

Cultural building and atmosphere creation

Using processes to guide system construction

Continuous construction to make stability a daily habit

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

DevOps SRE incident management

Written by

Qunhe Technology Quality Tech

Kujiale Technology Quality

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.