Big Data 16 min read

Data Stability Construction and Fault Governance Practices at Didi Customer Service

Didi’s multi‑year data‑stability program for its customer‑service platform progressed through fault‑centered engineering, business‑aligned cross‑team work, and capability normalization, instituting pre‑, mid‑ and post‑fault safeguards, clear ownership, automated alerts and repair tools, which cut fault count by 42 % and more than doubled mean‑time‑to‑repair while boosting team communication and satisfaction.

Didi Tech
Didi Tech
Didi Tech
Data Stability Construction and Fault Governance Practices at Didi Customer Service

The article describes Didi’s multi‑year effort to build data stability for its customer‑service business, emphasizing that metrics are the core lever of strong operations.

It outlines a three‑stage stability construction: (1) fault‑centered stability engineering (pre‑, during, post‑fault capabilities) that greatly reduced fault count and duration; (2) business‑centered stability work that formed cross‑organizational teams to address business‑technical alignment issues; (3) capability normalization that expanded stability work to include security, compliance, cost‑efficiency, and sustainable automation.

Data stability is positioned as the second‑stage work, focusing on lagging indicators (resolution rate, close‑rate, escalation rate, satisfaction, service quality) while real‑time indicators (incoming volume, queue volume, connection rate, reach rate) were already secured in the first stage.

The core principle is that data faults are most sensitive to time: total data impact and repair duration. Therefore, safeguards are required in the pre‑, mid‑, and post‑fault phases to ensure quick detection, localization, and recovery.

Pre‑fault safeguards include developers assessing impact on ODS and key metrics before data operations or schema changes.

Mid‑fault safeguards involve fine‑grained monitoring and alerting on critical table/column fields (null checks, type changes, DDL, key‑metric YoY/QoQ, timestamp format, cross‑system RPC consistency, historical reconciliation, regex content rules) enabled via BCP + low‑code platform and binlog listening.

Post‑fault safeguards provide reusable fix tools, redundant data, and repair scripts to accelerate data back‑fill and reduce rework.

The article stresses the need for clear data‑fault grading standards, defining three metric categories (OKR, settlement, ordinary) and linking fault severity to detection‑to‑repair time.

Responsibility mapping links ODS tables to owners, reducing fault‑owner identification from months to same‑day, and introduces automated change‑notification bots.

Monitoring gaps are addressed by extending system‑owner alerts to cover data accuracy, aiming for T+1 latency, and adding the listed alert rules.

A formal SOP for fault handling, supported by internal‑chat bots, guides mid‑ and post‑fault actions to prevent wrong fixes and rework.

Tool‑building efforts focus on data‑collection pipeline hardening (logbook bus), data‑replay for rapid fix, and reusable repair scripts.

Results show a 42% reduction in fault count and a 134% improvement in mean time to repair, plus hidden benefits such as clearer responsibility boundaries, improved communication, and higher data‑team satisfaction.

The conclusion highlights the distinct nature of data faults versus system faults: stop‑loss is only the beginning, and true work lies in data back‑fill, which benefits from standardized processes, automation, redundancy, and a “defensive‑programming” mindset.

automationData Warehouseincident responseData ReliabilityData Stabilityfault governancemetrics monitoringODS
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.