Building an Observability Platform for Mini‑Program Image Uploads Using SRE and Metrics‑Driven Development
The article describes how a three‑day, cross‑team investigation of a mini‑program image‑upload failure led to the design and implementation of an SRE‑driven, metrics‑driven observability platform that quantifies SLIs, automates tracing, and provides dashboards for real‑time and long‑term analysis, ultimately reducing MTTR.
Introduction
The author discusses the challenges of diagnosing image upload failures in a mini‑program and the motivation to combine Site Reliability Engineering (SRE) with Metrics‑Driven Development (MDD) to build an observability platform.
Incident Background
On 2022‑02‑10 users reported upload failures; a cross‑team investigation involving frontend, system architecture, and operations lasted three days before identifying a WeChat base‑library rollback as the root cause.
Problem Analysis
The upload path includes user side, network nodes, gateway (Kong), and backend services, each potentially causing failures. A systematic checklist was used to eliminate possibilities.
Observability Design
The solution defines SLI/SLO using a VALET‑style model, collects metrics at each hop, and instruments the mini‑program to send a fingerprint hash and trace_id, store logs locally when offline, and report them later.
Key metrics are derived from the VALET dimensions (Volume, Availability, Latency, Error, Ticket) and visualized in Grafana.
Platform Implementation
Data from users, network operators, Kong, and backend services are stored in ClickHouse, visualized with Grafana, and enriched with geographic and device dimensions.
Dashboards cover real‑time overview, long‑term trends, code‑error analysis, slow‑route risk, route details, and crawl analysis.
Operational Challenges
Managing log size, sampling, noise reduction, and query performance required log‑size control, sampling, noise filtering, view creation, and aggregation to minute‑level SLI.
Future Work
Plans include chaos engineering for proactive fault injection, broader adoption across platforms, migration to other services, and applying statistical models to automate anomaly significance assessment.
Conclusion
The observability platform reduced MTTR and demonstrated the value of MDD‑driven SRE practices for reliable mini‑program services.
HaoDF Tech Team
HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.