Operations 15 min read

How to Build Robust Online Stability: Practices, Metrics, and Team Strategies

This article outlines a comprehensive approach to online stability, covering preventive measures, service governance, capacity planning, incident detection, multi‑dimensional monitoring, alerting, R&D efficiency improvements, team building, and practical guidelines for simplifying, standardizing, automating, and scaling stability initiatives across an organization.

Efficient Ops

Nov 12, 2024

Online Stability Practices

The article focuses on practical online stability construction, covering stability assurance measures, R&D efficiency improvement, and team building, emphasizing simplification, standardization, process‑driven, and automation of complex tasks.

1. Online Stability Assurance Measures

1.1 Incident Prevention

1.1.1 Operations Foundations

a. Upstream/Downstream Machine Configuration Balance

b. Load Balancing

RPC traffic: same‑city grouping, optionally same‑rack for high‑performance needs.

DB traffic: MySQL, Redis, MQ isolated by city, high‑traffic prefers same rack.

c. Machine Utilization – Manage over‑utilized applications and scale promptly.

d. Elastic Scaling (K8s) – Mandatory for core services

e. Monitoring Mechanism – Dashboard reports for basic data; spider‑crawled data aggregated into reports.

1.1.2 Service Governance

a. Provider Governance – C‑end interfaces must have rate limiting.

b. Dependency Governance – Core dependencies require circuit breaking and degradation (mandatory).

c. Resource Governance – Manage MQ, Redis QPS and capacity.

d. DB Governance – Prohibit cross‑business references (mandatory).

e. Slow Query Governance

f. Risk Governance – Infrastructure and risk control initiatives.

g. Alert Governance – P0/P1 alerts must be followed up; unnecessary alerts should be tuned (mandatory).

1.1.3 System Capacity Estimation

a. Stress Testing (mandatory) – Tools: Trace, Mock, Shadow Table, etc.

b. Fault Drills (mandatory) – Tools: Fault‑drill platform; include rate‑limit, degradation, full GC, CPU 100 %, service outage, alert SOP, business switch drills.

1.1.4 Business Review & Risk Inspection

Core service stability review: code inspection, risk control integration.

Inconsistent online/offline behavior: excessive if‑else branching.

Asset loss governance: idempotent interfaces, reconciliation.

2. Incident Detection & Investigation

2.1 Observability Principle

Metrics, logs, and indicators are essential.

2.2 Tools

a. Metric – Cross‑service tracing; auto‑add trace IDs if missing.

b. Log – Capture SLF4J logs, ship to log center, query via ES + Kibana.

c. Indicator – Backend metrics, dashboards, threshold & intelligent alerts, frontend metrics.

2.3 Multi‑Dimensional Monitoring & Alerting

a. Monitoring Pyramid

Infrastructure & middleware monitoring – automatic reporting.

Application monitoring – automatic reporting of QPS, response time, JVM info.

Business monitoring – manual instrumentation by backend developers.

User‑experience monitoring – manual instrumentation by frontend developers.

b. Alerting – Automatic alerts for platform, middleware, and application metrics; manual tuning required.

c. Dashboards – Aggregate multiple indicators for simple calculations.

d. Stability Team Tasks – Standardize, automate, and ensure traceability between metrics and logs.

3. R&D Efficiency

3.1 Project Management

Drive demand lifecycle through a development‑process management platform.

3.2 Efficiency Improvements

Build common utility libraries (JSON, time conversion, etc.).

Establish framework, module, layering, and coding standards.

Address local development challenges: container agents, downstream dependencies, compile environment consistency, OS differences.

Adopt hot‑deployment tools (e.g., JRebel).

Promote elastic scaling based on QPS, CPU utilization, etc.

Service refactoring: performance optimization, service merging, stateful‑to‑stateless conversion, log centralization, serverless trials.

4. Team Building

Weekly stability meetings to review alerts, logs, risks, and set TODOs.

Regular knowledge sharing sessions.

Exams on standards, SOPs, and online operation norms.

Permission control: new hires cannot deploy within the first N months; permissions granted after passing exams.

5. Core Principles

Simplify complex tasks.

Standardize simple tasks.

Process‑ify standardized tasks.

Automate processes.

6. Practice Sharing: Unit Test Construction

Before governance: unit tests were inconsistent, fragmented, and pipelines lacked test enforcement.

Simplify: break down tasks, introduce JUnit5, PowerMock, TestableMock.

Standardize: pipeline templates, test coverage targets (20 % → 40 % → 60 % → 80 %).

Process‑ify: create a test‑case repository, enable automatic pipeline triggers on push/PR.

Automate: pipeline automation, configuration automation, data collection automation for test metrics.

7. Driving Implementation

Self‑driven initiative: recognize stability work as a continuous system project, proactively seek solutions, and take on extra tasks such as metric building and script development.

Downward push: break down vague tasks, assign owners and timelines, establish SOPs, tools, and measurable standards.

Upward management: secure leader support, obtain necessary permissions, and gain trust for stability efforts.

Horizontal collaboration: coordinate with QA, SRE, and other stability groups, share best practices, and promote cross‑team improvements.

Conclusion

Stability work involves many services, teams, and case‑by‑case governance; a methodical, end‑to‑end approach is essential for effective planning and execution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

team collaboration stability incident-response

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.