How Alibaba’s Mobile Team Built a Full‑Stack Observability System to Boost App Performance
This article details Alibaba's mobile engineering team's original approach to full‑link observability, describing the challenges of the Taobao app architecture, the evolution of monitoring to observability, the Falco OpenTracing model, and practical performance optimizations that improve issue‑resolution efficiency and user experience.
App Architecture Challenges
Since 2013 Alibaba’s mobile technology has evolved through three stages: Atlas container framework for large‑scale concurrency, ACCS full‑duplex low‑latency channels, and dynamic cross‑platform frameworks such as Weex and Mini‑Programs, forming a three‑layer architecture of business, framework/container, and infrastructure. Common problems include low operational efficiency, incomplete end‑to‑end tracing, inconsistent performance metrics, and high cost of mobile PaaS troubleshooting.
(Figure 1 Taobao App architecture challenges)
Observability System
Observability is a philosophy rather than a concrete technology. Traditional monitoring provides high‑level alerts, while observability combines data to reveal why components fail, covering Traces, Loggings, and Metrics.
(Figure 2 Relationship between monitoring and observability)
Observability Key Data
Loggings are derived from the TLOG system and can be structured into traces; Metrics are aggregated values for macro analysis; Traces record parent‑child relationships with detailed operation data, enabling both fine‑grained debugging and high‑level metric extraction.
(Figure 3 Observability key data)
Full‑Link Observability Architecture
The architecture is divided into four layers: Data (metric definitions and OpenTracing reporting), Domain (problem discovery,定位, continuous performance optimization), Platform (benchmarking against competitors and driving performance), and Business (full‑link view across client and server).
(Figure 4 Full‑link observability architecture concept)
Mobile OpenTracing – Falco Architecture
Falco adopts the OpenTracing model to unify Logs, Metrics, and Traces on the client side. Its data model includes Span (core OpenTracing fields), Scene (business scenario), Layer (business, frameworkContainer, ability), Stages (standardized phases), Module (e.g., DX, MTOP), and Logs.
(Figure 6 Falco data table model)
Falco Key Points
Unique, fast, short trace IDs.
TraceID and hierarchical Span IDs propagate end‑to‑end.
Bidirectional mapping between client trace IDs and backend EagleEye IDs for precise failure diagnosis.
Layered measurement enables consistent cross‑module performance comparison.
Structured event logging with columnar storage supports large‑scale aggregation.
Domain‑level problem data is persisted for continuous analysis.
Operational Practices Based on Falco
Improving log upload reliability, classifying logs for quick filtering, visualizing full‑link topologies, and extending EagleEye trace retention from minutes to days dramatically reduce issue‑resolution time.
(Figure 9 Problem‑driven user flow and operations system)
Macro Metric System
APM upgrades focus on user‑perceived metrics such as page‑on‑screen time, click response, and scroll frame rate, aligning data with real user experience.
(Figure 10 Calibrated startup data trend)
Optimization Practices
Examples include simplifying MTOP network calls to reduce data copies and thread switches, enabling dual‑channel Wi‑Fi + cellular networking on Android to improve latency under weak networks, and applying image‑size grading for low‑end devices.
(Figure 21 Extreme‑call AB test results)
(Figure 22 Android dual‑channel network optimization)
Summary & Outlook
The article demonstrates how a full‑link observability system built on OpenTracing and Falco transforms Alibaba’s mobile operations from manual, low‑efficiency processes to data‑driven, automated performance optimization, while outlining remaining challenges and future directions for a comprehensive mobile observability ecosystem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
