How to Build Robust Online Stability: Practices, Metrics, and Team Strategies
This article outlines a comprehensive approach to online stability, covering preventive measures, service governance, capacity planning, incident detection, multi‑dimensional monitoring, alerting, R&D efficiency improvements, team building, and practical guidelines for simplifying, standardizing, automating, and scaling stability initiatives across an organization.
Online Stability Practices
The article focuses on practical online stability construction, covering stability assurance measures, R&D efficiency improvement, and team building, emphasizing simplification, standardization, process‑driven, and automation of complex tasks.
1. Online Stability Assurance Measures
1.1 Incident Prevention
1.1.1 Operations Foundations
a. Upstream/Downstream Machine Configuration Balance
b. Load Balancing
RPC traffic: same‑city grouping, optionally same‑rack for high‑performance needs.
DB traffic: MySQL, Redis, MQ isolated by city, high‑traffic prefers same rack.
c. Machine Utilization – Manage over‑utilized applications and scale promptly.
d. Elastic Scaling (K8s) – Mandatory for core services
e. Monitoring Mechanism – Dashboard reports for basic data; spider‑crawled data aggregated into reports.
1.1.2 Service Governance
a. Provider Governance – C‑end interfaces must have rate limiting.
b. Dependency Governance – Core dependencies require circuit breaking and degradation (mandatory).
c. Resource Governance – Manage MQ, Redis QPS and capacity.
d. DB Governance – Prohibit cross‑business references (mandatory).
e. Slow Query Governance
f. Risk Governance – Infrastructure and risk control initiatives.
g. Alert Governance – P0/P1 alerts must be followed up; unnecessary alerts should be tuned (mandatory).
1.1.3 System Capacity Estimation
a. Stress Testing (mandatory) – Tools: Trace, Mock, Shadow Table, etc.
b. Fault Drills (mandatory) – Tools: Fault‑drill platform; include rate‑limit, degradation, full GC, CPU 100 %, service outage, alert SOP, business switch drills.
1.1.4 Business Review & Risk Inspection
Core service stability review: code inspection, risk control integration.
Inconsistent online/offline behavior: excessive if‑else branching.
Asset loss governance: idempotent interfaces, reconciliation.
2. Incident Detection & Investigation
2.1 Observability Principle
Metrics, logs, and indicators are essential.
2.2 Tools
a. Metric – Cross‑service tracing; auto‑add trace IDs if missing.
b. Log – Capture SLF4J logs, ship to log center, query via ES + Kibana.
c. Indicator – Backend metrics, dashboards, threshold & intelligent alerts, frontend metrics.
2.3 Multi‑Dimensional Monitoring & Alerting
a. Monitoring Pyramid
Infrastructure & middleware monitoring – automatic reporting.
Application monitoring – automatic reporting of QPS, response time, JVM info.
Business monitoring – manual instrumentation by backend developers.
User‑experience monitoring – manual instrumentation by frontend developers.
b. Alerting – Automatic alerts for platform, middleware, and application metrics; manual tuning required.
c. Dashboards – Aggregate multiple indicators for simple calculations.
d. Stability Team Tasks – Standardize, automate, and ensure traceability between metrics and logs.
3. R&D Efficiency
3.1 Project Management
Drive demand lifecycle through a development‑process management platform.
3.2 Efficiency Improvements
Build common utility libraries (JSON, time conversion, etc.).
Establish framework, module, layering, and coding standards.
Address local development challenges: container agents, downstream dependencies, compile environment consistency, OS differences.
Adopt hot‑deployment tools (e.g., JRebel).
Promote elastic scaling based on QPS, CPU utilization, etc.
Service refactoring: performance optimization, service merging, stateful‑to‑stateless conversion, log centralization, serverless trials.
4. Team Building
Weekly stability meetings to review alerts, logs, risks, and set TODOs.
Regular knowledge sharing sessions.
Exams on standards, SOPs, and online operation norms.
Permission control: new hires cannot deploy within the first N months; permissions granted after passing exams.
5. Core Principles
Simplify complex tasks.
Standardize simple tasks.
Process‑ify standardized tasks.
Automate processes.
6. Practice Sharing: Unit Test Construction
Before governance: unit tests were inconsistent, fragmented, and pipelines lacked test enforcement.
Simplify: break down tasks, introduce JUnit5, PowerMock, TestableMock.
Standardize: pipeline templates, test coverage targets (20 % → 40 % → 60 % → 80 %).
Process‑ify: create a test‑case repository, enable automatic pipeline triggers on push/PR.
Automate: pipeline automation, configuration automation, data collection automation for test metrics.
7. Driving Implementation
Self‑driven initiative: recognize stability work as a continuous system project, proactively seek solutions, and take on extra tasks such as metric building and script development.
Downward push: break down vague tasks, assign owners and timelines, establish SOPs, tools, and measurable standards.
Upward management: secure leader support, obtain necessary permissions, and gain trust for stability efforts.
Horizontal collaboration: coordinate with QA, SRE, and other stability groups, share best practices, and promote cross‑team improvements.
Conclusion
Stability work involves many services, teams, and case‑by‑case governance; a methodical, end‑to‑end approach is essential for effective planning and execution.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.