Backend Development 21 min read

Stability Guidelines and Anti‑Patterns for Backend Services

Drawing on five years of incident reviews, the article defines a comprehensive stability framework for backend services—mandating timeout hierarchies, weak dependencies, service-discovery integration, staged gray releases, robust monitoring, capacity planning, and strict change management—while cataloguing common anti-patterns such as over-aggressive circuit breaking, static retries, improper timeouts, tight coupling, and insufficient isolation, and urging regular rehearsal of these practices.

Didi Tech
Didi Tech
Didi Tech
Stability Guidelines and Anti‑Patterns for Backend Services

This article presents a set of stability‑related specifications derived from five years of real‑world incident reviews at the Shunfengche (Ride‑Sharing) service team. The goal is to help improve service stability by sharing proven engineering practices.

The backend team, as the largest engineering group in the technology department, increasingly relies on process standards to boost delivery quality and efficiency. The document outlines an executable, minimally restrictive engineering specification covering development processes, stability, performance cost, and more.

1. Terminology

Key terms are defined to aid understanding:

Service Tiering : Primary services (Tier‑1) directly affect core business metrics (e.g., order volume) and must be prioritized first when issues arise.

Preview Cluster : An environment identical to production but without external traffic, used for internal testing.

Small‑Traffic Cluster : Similar to production but only receives traffic from selected cities, ensuring traffic isolation.

Gray Release : A staged rollout process (preview, gray city, 10%, 50%, 100% traffic) to ensure safe deployment.

Full‑Link Stress Test : Non‑intrusive pressure testing of the production environment to identify capacity and bottlenecks.

Multi‑Active Data Centers : Deployments across multiple data centers to enable rapid traffic switchover during failures.

2. Stability Specifications

Design

Mandatory : Callers must set timeouts, with downstream timeouts decreasing along the call chain (recommended values are shown in the accompanying diagram).

Mandatory : New core‑process dependencies default to weak dependencies; strong dependencies require review approval.

Mandatory : Downstream services offering service discovery must be accessed via that mechanism to control node selection and timeouts.

Mandatory : All internal services must integrate with service discovery; external services should be encouraged to do the same.

Recommended : Frameworks should support one‑click circuit breaking for dependent services.

Recommended : Prefer stateless service designs.

Recommended : Design interfaces to be idempotent and guard against re‑entrancy.

Recommended : Keep system design simple and rely on mature technologies.

Recommended : Set appropriate rate‑limit configurations for core services.

Deployment & Operations

Prohibit direct manipulation of online data via ad‑hoc scripts; any changes must go through QA testing.

All service releases must pass through the release platform and integrate with the quality platform (including automated test cases, core metric charts, and checklists).

Tier‑1 services must include preview and small‑traffic clusters and be deployed across two data centers.

Non‑Tier‑1 services are recommended to include a preview cluster.

Capacity planning for new services should involve interface stress testing or full‑traffic testing to verify module capacity.

Monitoring & Alerts

All service machines must have basic monitoring (CPU, I/O, memory, disk, coredump, ports).

Basic service monitoring must include QPS, fatal error count, and latency.

Core business metrics (order volume, pickup volume, payment volume, etc.) must be monitored and alerted.

A comprehensive dashboard should cover core modules for rapid issue localization.

Change Management

All Tier‑1 service changes must follow the gray‑release process.

All Tier‑1 changes (code or configuration) must have a rollback plan for quick recovery.

Avoid “code‑hitch” deployments.

Rollback should include both code and configuration to maintain consistency.

Complex configuration changes should include validation mechanisms.

3. Stability Anti‑Patterns

The following anti‑patterns are illustrated with real incidents and suggested solutions:

Excessive Node Circuit‑Breaker Strategy : Over‑aggressive circuit breaking can cause traffic concentration and cascade failures. Solution: Implement protective measures to avoid over‑circuit‑breaking.

Fixed Retry Sequence : Using a static retry order can lead to overload on fallback nodes. Solution: Consider random or adaptive retry algorithms.

Unreasonable Timeout Settings : Improper upstream timeout values can drag down upstream services. Solution: Set timeouts based on the 99th‑percentile latency of the call chain.

Ignoring Multiple Downstream Calls in a Single Request : Serial calls to the same downstream service can amplify failures. Solution: Account for cumulative downstream latency and timeout.

Unreasonable Retry Logic : Multiple retries across the call chain amplify failures. Solution: Consolidate retry logic to a single, well‑controlled point.

Not Weak‑Linking Non‑Core Processes : Tight coupling makes the system fragile. Solution: Identify and weaken dependencies for non‑essential flows.

ID Overflow : Using limited‑size IDs can cause overflow errors. Solution: Choose appropriate ID types and review for overflow risk.

Deployment Without Network Segment Consideration : Deploying all instances on the same switch leads to single‑point failures. Solution: Distribute instances across different racks, switches, and data centers.

Resource Isolation Failure in Co‑Location : High CPU usage by one service can affect others. Solution: Enforce resource isolation for co‑located services.

Lack of Core Business Isolation : Faulty non‑core processes can bring down core services. Solution: Separate MQ clusters for core and non‑core workflows.

Insufficient Capacity Planning : Deploying too few instances can cause overload when one fails. Solution: Reserve buffer capacity and consider elastic cloud resources.

Improper Change Management : Code‑hitch releases, missing rollback code, excessive concurrent deployments, and lack of testing lead to incidents. Solution: Enforce strict code review, rollback procedures, controlled concurrency, and comprehensive testing.

Missing or Stale Monitoring : Lack of basic or business monitoring delays fault detection. Solution: Maintain monitoring checklists and regularly review alerts.

Inadequate Incident Response : Not prioritizing fault handling, missing rollback, or lacking post‑mortem processes. Solution: Treat incidents with highest priority, perform immediate rollback, and follow a documented post‑mortem workflow.

4. Code Snippet Example

<section style="margin: 15px 8px 5px; letter-spacing: 0.544px; text-align: center; line-height: 1.75em"><span style="color: rgb(0, 0, 0)"><strong style='font-family: mp-quote, -apple-system-font, BlinkMacSystemFont, "Helvetica Neue", "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif; letter-spacing: 0.544px'><span style='color: rgb(0, 0, 0); font-size: 15px; font-family: -apple-system, system-ui, "Segoe UI", Roboto, Ubuntu, "Helvetica Neue", Helvetica, Arial, "PingFang SC", "Microsoft YaHei UI", "Microsoft YaHei", 微软雅黑, sans-serif; text-align: start'>团队简介</span></strong></span></section>

The article concludes with a reminder to regularly review and rehearse all stability measures, monitoring setups, and incident response plans to ensure they remain effective.

Monitoringincident managementservice reliabilitybackend stabilitydeployment best practices
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.