Stability Engineering Practices for Baidu's DuoliXiong Local Service Platform
This article details the stability engineering approach of Baidu's DuoliXiong local service platform, covering business challenges, construction philosophy, solution design, technical review, coding standards, deployment workflow, problem handling, eventual consistency, idempotency, monitoring, and future operational planning.
DuoliXiong, Baidu's local life service platform, provides a SaaS solution for merchants and users to discover and purchase local services.
The platform faces rapid growth of micro‑services (user, product, order, merchant, coupon, payment, etc.), long internal dependency chains, numerous external service integrations, and short iteration cycles, all of which create new stability challenges.
Stability construction is approached from three dimensions—technical standards, business standards, and micro‑service design—emphasizing monitoring, alerting, fault tolerance, automation, and quality to improve reliability and user experience.
Solution design includes versioned documentation, development specifications, project background, technical architecture, interface contracts, storage design, compatibility considerations, monitoring & alarm rules, and release documentation.
Technical review defines review scope, gathers design documents, assigns reviewers, establishes entry criteria, and conducts periodic architecture reviews to ensure soundness.
Coding standards and code review aim to guarantee code quality, development efficiency, team collaboration, reduced communication cost, and enhanced service stability.
The deployment process follows a closed‑loop: solution design → technical review → development → code review (CR) → testing → release → post‑release monitoring and problem handling, with detailed release windows, pre‑release checks, and rollback procedures.
Problem handling follows the principle of "announce, stop‑loss, then investigate," prioritizing online issues, using separate bug‑fix branches, and clear escalation responsibilities.
The stability closed‑loop is illustrated in the diagram below:
For eventual consistency, the platform adopts an asynchronous call model with a local message table to guarantee data consistency across services.
Idempotency is achieved by generating a globally unique ID (e.g., UUID or Snowflake) and using an anti‑duplicate table to ensure repeat requests do not alter business state.
Monitoring and alerting leverage Kubernetes, Prometheus, Grafana, Trace, Tianyan, and Actuator. The monitoring‑alert workflow is shown below:
Future planning includes automated scaling based on custom Prometheus metrics, intelligent fault tolerance for core business flows (order, payment, verification) and dependent services (Redis, MQ), ensuring continuous performance and reliability.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.