Mobile Development 17 min read

How Taobao Overhauled Mobile Diagnostics to Achieve 5‑15‑60 SLA

Taobao redesigned its mobile client’s diagnostics and logging architecture—introducing scenario‑based monitoring, standardized log protocols, snapshot collection, and change‑tracking SDKs—to meet a 5‑minute response, 15‑minute identification, and 60‑minute recovery goal, dramatically improving issue detection, analysis, and resolution efficiency.

Alibaba Terminal Technology
Alibaba Terminal Technology
Alibaba Terminal Technology
How Taobao Overhauled Mobile Diagnostics to Achieve 5‑15‑60 SLA

Overview

Taobao, a large‑scale mobile application, aims to guarantee client stability with a 5‑15‑60 goal: respond to alerts within 5 minutes, locate issues within 15 minutes, and recover within 60 minutes. Existing monitoring and troubleshooting were insufficient due to coarse crash aggregation, limited client‑side data, and lack of change‑quality monitoring.

Diagnosis System Upgrade

The new diagnostic architecture introduces the concept of scenarios . Instead of treating each exception as an isolated event, a scenario can combine an exception with multiple conditions, allowing richer and more precise data collection.

Client‑side abnormal data now includes standardized log data, full‑link Trace data, runtime Metric data, and snapshot data. The platform can monitor, alert, visualize, and perform preliminary diagnosis using these semantic data.

Log System Upgrade

The log system was upgraded to improve write performance, compression, upload success rate, and data dashboards. A standardized log protocol defines five log types:

CodeLog – legacy, unstructured logs.

PageLog – records page navigation.

EventLog – records events such as foreground/background switches, network status, config changes, crashes, clicks, etc.

MetricLog – records runtime metrics like memory, CPU, and business‑specific indicators.

SpanLog – full‑link logs that connect distributed points, based on OpenTrace.

Standardized logs enable fast replay and analysis on the platform side, improving issue localization.

Client‑Side Diagnosis Upgrade

Existing client tools (APM, TLOG, UT, Crash SDK, memory/card‑freeze detectors) provide valuable data but suffer from data fragmentation and lack of integration. New diagnostic and coloring SDKs were added to:

Integrate existing tools and write data using the standardized log protocol.

Listen to client‑side changes and generate coloring tags.

Capture snapshots (runtime info, change info) when exceptions occur.

Report scenario‑based data according to server‑side rules.

Support directed diagnostics and real‑time log upload.

Change Monitoring

Most Taobao issues stem from online changes. A coloring SDK collects change data (configuration, AB tests, custom business changes) and generates unique coloring identifiers. These identifiers allow the platform to monitor change effectiveness, calculate crash rates for changed code, and decide whether to roll out or rollback changes.

Scenario‑Based Reporting

Scenario reporting automates data collection when an exception threshold is approached. The platform can pre‑define rules (trigger, condition, action) and push them to clients. Triggers include crashes, user screenshots, network errors, page errors, system overloads, business errors, and app start. Conditions span device/user/version info, network status, page context, and specific exception details. Actions include uploading TLOG, snapshot, memory info, or invoking other diagnostic tools.

Scenario rules are managed on a dedicated platform with standard release, review, and gray‑release workflows. Data flow is throttled to avoid server overload, with thresholds and client‑side rate limits (Wi‑Fi only, daily limits, upload intervals).

Future Outlook

The diagnostic capability continues to evolve toward real‑time logs, remote debugging, full‑link data, and richer abnormal data. The next challenge is leveraging collected data for root‑cause analysis, impact assessment, knowledge‑base building, and eventually enabling client‑side self‑healing through dynamic degradation and automatic fixes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Operationsclient-sidelog systemmobile diagnosticsscenario monitoring
Alibaba Terminal Technology
Written by

Alibaba Terminal Technology

Official public account of Alibaba Terminal

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.