Operations 18 min read

How We Built a Mini‑Program Observability Platform to Slash Incident Resolution Time

After a three‑day, ten‑person investigation into a mini‑program image‑upload failure, we designed and implemented an end‑to‑end observability platform using MDD and SRE principles, defining SLI/SLO, instrumenting client, network, gateway and backend layers, and visualizing metrics with Grafana, ClickHouse and Prometheus.

dbaplus Community

Jun 13, 2022

How We Built a Mini‑Program Observability Platform to Slash Incident Resolution Time

Incident Overview

On 10 February, a large number of users of the "Good Doctor" mini‑program reported image‑upload failures. Ten engineers from operations, frontend, backend and architecture teams investigated for three days before identifying the root cause.

Root Cause

The failure was traced to a WeChat base‑library upgrade that introduced a caching regression for a subset of iOS users. The regression manifested as HTTP 400 errors on the image‑upload API and required users to clear local caches.

Observability Goals

Detect anomalies reliably.

Reduce the expertise required for root‑cause analysis.

Platform‑ize the toolchain to accelerate investigations.

Derive actionable insights from massive log data.

Minimize additional engineering overhead.

MDD‑Driven Observability Design

Model‑Driven Development (MDD) was combined with SRE principles to define Service Level Indicators (SLI) and Service Level Objectives (SLO) for each stage of the upload flow: client, network node, entry gateway, and backend service. The Google VALET model (Volume, Availability, Latency, Error, Ticket) was adopted as the SLI framework. Two‑point monitoring (request start and end) creates a trace line that separates long‑term trend analysis from real‑time anomaly detection.

Client‑side Instrumentation

On each mini‑program launch a probe request reports a fingerprint containing platform, app version, WeChat version, base‑library version, and mini‑program package name.

The fingerprint is hashed to a hash_key that is attached to all subsequent request URLs and HTTP headers.

A unique trace_id is generated per request and propagated downstream via headers.

When network conditions are poor, error logs are buffered in LocalStorage and uploaded later; each log entry is tagged with the associated trace_id.

Network‑node Collection

Carrier identifiers are added to traffic to distinguish ISP‑level paths.

Carrier logs are ingested into ClickHouse for later analysis, making the transmission layer observable.

Entry‑gateway (Kong) Enhancements

Routes are split by business type (e.g., upload, order submission) and tagged with route_name that includes department and page information.

Kong plugins generate or forward trace_id, linking Kong logs with backend trace logs.

Backend Service Metrics

Image service is instrumented to record metrics per image size (upload/download latency, success rate).

Standardized error codes are defined to facilitate automated anomaly detection.

Observability Platform

SLI data are stored in ClickHouse; Grafana visualizes the data via SQL‑based dashboards.

Geo‑location analysis converts request IPs to latitude/longitude for regional SLI breakdowns.

Existing analysis engine is reused; only new alert rules were added.

Key Dashboards

Mini‑Program Overview : 15‑minute instant metrics (QPS, UV, slow‑route count, error count) with threshold‑based coloring; includes long‑term trends.

Long‑Term Trend : Weekly/monthly/yearly views to assess release impact.

Code Exception Analysis : Drill‑down on error‑code distribution; triggers manual investigation when code = 400 error rate exceeds 0.5%.

Slow‑Route Risk : Top‑10 slow routes, distinguishing chronic slowness from spikes, correlated with APM data.

Route Detail : Per‑route P99 latency, code distribution, UA/IP analysis, regional breakdown, backend node health.

Spider‑Crawl Analysis : Detects abnormal UA or IP concentration indicative of crawler activity.

Operational Challenges

Control error‑log size to avoid Flume overload.

Implement sampling and rate‑limiting to prevent bandwidth exhaustion during bursty failures.

Noise reduction for unrelated errors (e.g., periodic security scans).

SQL performance optimization via materialized views and minute‑level aggregation.

Future Work

Automated chaos experiments to inject failures and validate the observability platform.

Extend the approach to other front‑ends (web, PC, native apps) and middleware.

Integrate statistical anomaly‑detection algorithms to quantify metric deviations.

Develop AI‑assisted root‑cause suggestions based on historical incident data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability Metrics SRE Mini Program grafana MDD

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.