Operations 20 min read

How MDD and SRE Cut Mini‑Program Image‑Upload Failures from Days to Minutes

This article recounts a three‑day image‑upload outage in a mini‑program, analyzes the multi‑layer causes, and shows how combining Metrics‑Driven Development with SRE and a custom observability platform dramatically reduces diagnosis time and improves reliability.

ITPUB

Jun 18, 2022

How MDD and SRE Cut Mini‑Program Image‑Upload Failures from Days to Minutes

Background and Motivation

On February 10, 2022, users of a medical‑service mini‑program reported image‑upload failures. Ten engineers from front‑end, system architecture, and operations spent three days troubleshooting without a clear root cause, highlighting the high cost and pain of manual incident analysis.

Root‑Cause Investigation

Suspected network link issues by switching carriers and performing packet captures, but no definitive evidence was found.

Considered user‑side network problems; repeated attempts still failed, while the same actions succeeded in the native app.

Explored a recent HTTP/2 change in the mini‑program as a possible bug, but the issue persisted after an emergency release.

Restarted the Kong gateway process, assuming lingering resources, yet the error remained.

Analyzed code = 400 trends and discovered a spike coinciding with a WeChat base‑library upgrade that affected a subset of iOS users.

The investigation confirmed the failure originated from the mini‑program side, but the process was costly and painful, prompting a search for a faster, data‑driven approach.

Understanding the Upload Flow

The simplified upload model consists of four key stages: user device, network nodes, entry gateway (Kong), and backend services. Failures can arise at any of these points.

Key Questions for Metrics‑Driven Development (MDD)

Can anomalies be reliably observed?

How to reduce the expertise required for root‑cause analysis?

Which toolchains can be platform‑ized to accelerate anomaly analysis?

Is it feasible to extract actionable insights from massive error data?

Will the observability platform add learning overhead for engineers?

Quantifying Anomalies with SLI/SLO

Effective anomaly detection requires turning failures into measurable Service Level Indicators (SLI) and defining Service Level Objectives (SLO). Two SLO strategies are suggested: fixed thresholds (e.g., QPS < 10k) or relative changes (e.g., code = 404 increase > 20%).

Building the Observability Platform

User Side

On each mini‑program start, send a probe request reporting version, platform, and WeChat details.

Generate a hash_key fingerprint of version info and attach it to subsequent request URLs and headers.

Associate hash_key with Kong logs to derive terminal‑level SLIs.

Include a trace_id in every request header for end‑to‑end tracing.

Cache error logs locally during network outages and upload them later, tagging each entry with trace_id.

Network Nodes

Add carrier identifiers to logs for cross‑carrier availability analysis.

Store carrier logs in ClickHouse for downstream querying.

Entry Gateway (Kong)

Split route_name by business type (e.g., upload, order).

Tag routes with business department and page information for targeted alerts.

Plugin generates or propagates trace_id to link Kong logs with backend traces.

Backend Services

Instrument logs to capture image size, upload/download metrics.

Define explicit error codes to aid root‑cause analysis.

Observability Platform

Leverage an existing analysis engine; only add new alert rules for this scenario.

Store SLIs in ClickHouse and visualize them with Grafana using SQL‑based panels.

Enrich logs with geolocation (IP → latitude/longitude) for regional SLIs.

The overall log‑reporting pipeline is illustrated below:

Challenges Encountered

Controlling error‑log size to avoid Flume ingestion failures.

Sampling and rate‑limiting to prevent bandwidth overload during spikes.

Noise reduction for unrelated alerts (e.g., periodic security scans).

Optimizing SQL queries and adding materialized views to keep analysis performant as data volume grows.

Resulting Dashboards

No 1 – Mini‑Program Overview

The homepage shows real‑time metrics for the last 15 minutes and long‑term trends, highlighting QPS, slow routes, and error rates with drill‑down capabilities.

No 2 – Long‑Term Trend Analysis

Provides weekly, monthly, and yearly views to correlate releases with metric shifts.

No 3 – Code Error Analysis

Drills down from aggregate error codes to specific routes, evaluating thresholds such as p = total_400 / (total_200 + total_400) to decide on manual intervention.

No 4 – Slow‑Route Risk

Highlights top‑10 slow routes, distinguishing between chronic latency and sudden spikes.

No 5 – Route Detail Analysis

Examines code distribution, trend deviations, crawler impact, regional differences, carrier and backend node anomalies.

No 6 – Crawler Analysis

Detects abnormal spikes in UA or client IP concentration to flag automated crawling.

No 7 – Program Exception Overview

Aggregates exception statistics for quick health checks.

Future Evolution

Validation will involve injecting synthetic failures (e.g., chaos engineering) to ensure the platform reliably surfaces issues. Promotion requires training, hack‑athons, and iterative improvements. Horizontal migration aims to extend the observability stack to other front‑ends (web, native apps) and middleware, enabling engineers to quantify performance gains such as reduced P99 latency or increased UV.

Ultimately, while the platform has moved from zero to one, continuous refinement is needed to further shorten MTTR and embed data‑driven root‑cause analysis into everyday engineering practice.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations Observability SRE Mini Program Metrics-Driven Development Upload Failure

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background and Motivation

Root‑Cause Investigation

Understanding the Upload Flow

Key Questions for Metrics‑Driven Development (MDD)

Quantifying Anomalies with SLI/SLO

Building the Observability Platform

User Side

Network Nodes

Entry Gateway (Kong)

Backend Services

Observability Platform

Challenges Encountered

Resulting Dashboards

No 1 – Mini‑Program Overview

No 2 – Long‑Term Trend Analysis

No 3 – Code Error Analysis

No 4 – Slow‑Route Risk

No 5 – Route Detail Analysis

No 6 – Crawler Analysis

No 7 – Program Exception Overview

Future Evolution

ITPUB

How this landed with the community

Was this worth your time?

0 Comments

No 1 – Mini‑Program Overview

No 2 – Long‑Term Trend Analysis

No 3 – Code Error Analysis

No 4 – Slow‑Route Risk

No 5 – Route Detail Analysis

No 6 – Crawler Analysis

No 7 – Program Exception Overview