Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies
This article details how the supply‑chain big‑screen dashboard for Double‑11 maintains high stability by mapping the full data‑flow, identifying risk points across ingestion, processing, storage and service layers, and applying comprehensive technical safeguards such as high‑availability design, fault‑tolerance, monitoring, and coordinated operational procedures.
Background
The supply‑chain dashboard is a core logistics report used for major promotions, featuring over 170 metrics, more than 30 dependent interfaces, a long data chain, and strict stability requirements.
1. Full‑Chain Process Diagram
The first step is to draw a complete flow diagram, then drill down into each metric processing detail to discover issues and devise targeted safeguards.
2. Risk Point Identification
The dashboard’s pipeline is divided into four layers: data ingestion, metric processing, metric storage, and metric service. Key risks per layer include:
Data Ingestion Layer : long processing chain (Hive, JSF, HTTP, JDQ, Flink, DTS, CK, EasyData), many dependent parties, multiple ingestion types.
Metric Processing Layer : multi‑dimensional metrics with ordered calculations, external dependencies requiring recomputation, flexible promotion‑strategy adjustments.
Metric Storage Layer : cross‑business impact and need for rapid anomaly localization.
Metric Service Layer : interface stability, degradable and fallback mechanisms, business isolation.
Monitoring Management : monitoring of metric processing and rapid fault localization.
3. Technical Safeguard Strategies
3.1 Data Ingestion Layer
3.1.1 Long Processing Chain
Define clear boundaries and assign responsibility to four zones: Hive team, real‑time processing team, interface providers, and SCM team.
Ensure high availability for each dependent component (Hive, Flink, interfaces) and add pre‑emptive monitoring.
3.1.2 Multiple Dependent Parties
Document all dependent interfaces and negotiate SLAs; sample interface matrix is shown in the original diagram.
3.1.3 Multiple Ingestion Types
Offline Hive: dedicated promotion‑heavy tasks, monitoring, and fire‑watch tables.
Business Import: validation and mock data import for Double‑11.
External JSF/HTTP: monitoring, retry, degradation, and fallback.
3.2 Metric Processing Layer
3.2.1 Multi‑Dimensional Metrics
Separate tables by dimension (warehouse, region) and by granularity (minute, hour, cache, history).
3.2.2 Re‑computation, Fault Tolerance, Fast Recovery
Implement generic degradation for external interfaces, fallback to the latest successful result within 30 minutes, and design fast recomputation paths.
3.2.3 Flexible Promotion Strategy
Expose strategy configuration via DUCC; example JSON configuration:
{
"sTime": "2024-11-xx 00:00:00",
"eTime": "2024-11-xx 19:59:59",
"tbSTime": "2023-11-xx 00:00:00",
"tbETime": "2023-11-xx 19:59:59",
"hbSTime": "2024-06-xx 00:00:00",
"hbETime": "2024-06-xx 19:59:59",
"showType": "24h",
"special24hCompDateStr": "2024-11-xx",
"specialCompDateStr": ""
}3.3 Metric Storage Layer
MySQL is deployed in a primary‑three‑replica topology, separating databases for main screen, core board, and other reports. Doris receives asynchronous binlog replication for long‑term storage.
Metrics are stored with JSON tagging to enable fast filtering; SQL queries extract needed fields directly from JSON.
3.4 Metric Service Layer
Interface stability ensured through load testing and isolation.
Degradation: on single‑interface failure, fallback to the most recent successful data within 30 minutes.
Fallback strategies for abnormal categories during prediction.
3.5 Monitoring Management
Two principles: pre‑emptive monitoring to detect upstream issues early, and comprehensive coverage across processing, querying, data pushing, and accuracy checks. Dashboards display interface availability, internal method health, and data correctness.
4. Additional Process Safeguards
4.1 Communication & Collaboration
Establish a dedicated promotion‑support chat group to streamline coordination among many stakeholders.
4.2 Full‑Chain Rehearsals
Conduct bi‑annual end‑to‑end drills to familiarize teams with configurations and validate special‑promotion strategies.
4.3 Business‑Linkage & Pre‑Validation
Collaborate with business to verify historical same‑period and环比 data, and mock promotion dates in pre‑release environments to ensure data accuracy.
4.4 Result‑First Mindset
Prioritize dashboard stability and data correctness over blame‑shifting, driving proactive issue resolution.
4.5 Team Effort
Success relies on collective effort across development, operations, and upstream partners.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.