Operations 17 min read

Building an Observability Platform for Mini‑Program Image Uploads Using SRE and Metrics‑Driven Development

The article describes how a three‑day, cross‑team investigation of a mini‑program image‑upload failure led to the design and implementation of an SRE‑driven, metrics‑driven observability platform that quantifies SLIs, automates tracing, and provides dashboards for real‑time and long‑term analysis, ultimately reducing MTTR.

HaoDF Tech Team

Mar 29, 2022

Building an Observability Platform for Mini‑Program Image Uploads Using SRE and Metrics‑Driven Development

Introduction

The author discusses the challenges of diagnosing image upload failures in a mini‑program and the motivation to combine Site Reliability Engineering (SRE) with Metrics‑Driven Development (MDD) to build an observability platform.

Incident Background

On 2022‑02‑10 users reported upload failures; a cross‑team investigation involving frontend, system architecture, and operations lasted three days before identifying a WeChat base‑library rollback as the root cause.

Problem Analysis

The upload path includes user side, network nodes, gateway (Kong), and backend services, each potentially causing failures. A systematic checklist was used to eliminate possibilities.

Observability Design

The solution defines SLI/SLO using a VALET‑style model, collects metrics at each hop, and instruments the mini‑program to send a fingerprint hash and trace_id, store logs locally when offline, and report them later.

Key metrics are derived from the VALET dimensions (Volume, Availability, Latency, Error, Ticket) and visualized in Grafana.

Platform Implementation

Data from users, network operators, Kong, and backend services are stored in ClickHouse, visualized with Grafana, and enriched with geographic and device dimensions.

Dashboards cover real‑time overview, long‑term trends, code‑error analysis, slow‑route risk, route details, and crawl analysis.

Operational Challenges

Managing log size, sampling, noise reduction, and query performance required log‑size control, sampling, noise filtering, view creation, and aggregation to minute‑level SLI.

Future Work

Plans include chaos engineering for proactive fault injection, broader adoption across platforms, migration to other services, and applying statistical models to automate anomaly significance assessment.

Conclusion

The observability platform reduced MTTR and demonstrated the value of MDD‑driven SRE practices for reliable mini‑program services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend performance monitoring SRE Metrics-Driven Development Mini-Program

Written by

HaoDF Tech Team

HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.