System Stability Practices: From Development to Production
This article outlines comprehensive system stability strategies for backend development, covering technical design reviews, key reliability techniques such as rate limiting, circuit breaking, timeout handling, isolation, and deployment safeguards like monitoring, gray releases, and rollback, aiming to reduce incidents and improve operational resilience.
Background
System stability is essential for high‑availability, high‑performance, and high‑concurrency services, especially during peak periods such as JD's 618 promotion. Stability must be addressed throughout the entire development lifecycle—from requirement gathering to operations.
Development Phase
Key deliverables include technical design documents and code. Emphasis is placed on thorough design reviews involving architects, developers, testers, and product owners to ensure alignment with requirements and best practices.
Technical Plan Highlights
Rate Limiting : Protect services from traffic spikes using algorithms like token bucket, leaky bucket, sliding window, etc.
Circuit Breaking & Degradation : Prevent downstream failures from cascading, with both automatic and manual degradation strategies.
Timeouts : Configure sensible timeout values based on TP99 measurements and follow a funnel principle across service call chains.
Retries : Limit retry attempts, consider idempotency for write operations, and avoid retry storms.
Compatibility : Ensure forward and backward compatibility to avoid data loss during rollbacks.
Isolation : Apply isolation at system, environment, data, core‑non‑core, read‑write, and thread‑pool levels to contain failures.
Code Review
Establish a consistent team coding style, focus reviews on style, performance, and security, practice pair programming, limit the amount of code per review, and maintain an open mindset for continuous improvement.
Deployment Phase
Deployments are high‑risk periods; failures often stem from code changes, database migrations, or configuration updates. Adopt the three‑pillars of reliable releases: monitoring, gray‑release, and rollback.
Monitoring
Define business and technical metrics (availability, TP99, request volume, CPU, memory, etc.) and set alert thresholds with a “strict‑then‑relaxed” approach.
Gray Release
Gradually roll out changes across machines, data centers, regions, or user groups to limit blast radius.
Rollback
Prefer quick rollback via feature flags; if necessary, perform versioned code or data rollbacks, ensuring forward compatibility to avoid post‑rollback issues.
Online Issue Management
Identify problems early through self‑awareness, monitoring alerts, and business feedback. Respond by preserving the incident context, providing information, restoring service quickly (rollback, restart, scaling, degradation), confirming recovery, and communicating to stakeholders.
Problem Lifecycle
From detection to resolution, then post‑mortem analysis using knowledge, tools, and methods (e.g., 5‑Why analysis) to prevent recurrence.
References
https://itrevolution.com/articles/20-years-of-google-sre-10-key-lessons-for-reliability/
https://learn.microsoft.com/en-us/previous-versions/msp-n-p/jj591573(v=pandp.10)
https://sre.google/books/
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.