Operations 15 min read

Achieving 1‑5‑10 Front‑End Monitoring with JSTracker for Double‑11

This article explains how the JSTracker platform was used to build a comprehensive end‑to‑end front‑end monitoring and data analysis solution that meets the 1‑5‑10 safety production goal—detecting issues within one minute, locating them in five, and fixing them in ten—by improving coverage, subscription, metrics, and gray‑release monitoring for Alibaba’s Double‑11 promotion.

Taobao Frontend Technology
Taobao Frontend Technology
Taobao Frontend Technology
Achieving 1‑5‑10 Front‑End Monitoring with JSTracker for Double‑11

Monitoring is the front line of safe production; effective alarm coverage, online issue detection, and rapid problem localization are core capabilities.

Safety production overall goal 1‑5‑10: detect within 1 minute, locate within 5 minutes, fix within 10 minutes.

JSTracker platform provides an end‑to‑end front‑end monitoring and data analysis platform with real‑time monitoring, multi‑endpoint coverage, data analysis, and intelligent capabilities, supporting Alibaba’s Double‑11 stability.

This article describes how we built a 1‑5‑10 solution on JSTracker to ensure stability of core businesses such as Taobao Live, venues, stores, interaction, and transactions during the Double‑11 promotion.

Current Situation

Although front‑end cross‑platform solutions have matured, fault detection rates remain low; analysis of FY18‑FY20 front‑end incidents shows monitoring detection averages below 30% and average repair time exceeds one hour.

Problem Discovery

Most failures are not discovered promptly; the main issue is not alarm effectiveness but passive notifications from online feedback, public opinion, or complaints. Specific problems include:

Business not integrated with monitoring; lack of safety awareness; manual SDK injection leaves many pages unmonitored.

Core metrics not subscribed; many pages have monitoring but no alarm or incomplete metric subscription.

Incomplete monitoring metrics; traditional front‑end focus on runtime errors (jserror, API failures) while missing metrics such as CDN node errors, white‑screen, or crash events.

Rapid Recovery

Average recovery time is far from the 10‑minute target. A complete development flow includes development → release → online verification regression, illustrated below:

If an issue is already released, achieving 10‑minute recovery is difficult; focus must shift to pre‑release development and release stages, emphasizing:

Pre‑release : comprehensive automated testing, e.g., detecting resource anomalies and JSErrors before release.

Release process : enable gray release, monitoring, and rollback capabilities.

Overall Solution

Centered on the 1‑5‑10 goal, solutions for “problem discovery” and “rapid recovery” are as follows:

Monitoring Coverage

Address coverage by improving access, subscription, and metric coverage to ensure 100% business integration and complete metric set.

Access Coverage

Two dimensions: infrastructure improvement and business governance.

Infrastructure : unify default access in solution layer; standardize monitoring collection and data specifications across source code and build layers.

Business governance : use team‑level statistics to measure page safety scores and drive rapid integration.

Business governance requires metric statistics and measurement methods:

Metric Statistics

Collect various dimensions (team, time, etc.) to aggregate metric data; core ideas include constructing full‑path employee IDs and using LIKE queries for hierarchical retrieval.

Measurement & Red‑Black List

Metrics support business decisions via a pipeline: raw data → analysis → metric measurement → business decisions, establishing a metric model for rapid issue detection.

Subscription Coverage

Many pages only subscribe to jserror, ignoring white‑screen, crash, etc. The solution includes:

Metric subscription completion : use governance process to identify unsubscribed pages and apply one‑click subscription.

Release subscription : after page release, subscribe to core metrics incrementally.

Metric Coverage

Cross‑endpoint pages have three stages: container start → air rendering → page execution. Monitoring points include:

Container layer : detect white‑screen, crash in weex, webview, etc.

Origin layer : CDN anomalies not visible from front‑end.

Page layer : SDK captures global exception points as metrics.

Full‑chain stability requires unified data ingestion and page‑aligned metrics.

Gray Release Monitoring

Rapid recovery also needs faster issue detection and rollback. About 80% of online problems stem from changes; gray monitoring distinguishes new‑version logs to detect error rate increases.

Key steps:

Metric collection : scripts read global variables; containers obtain gray flag from response headers.

Monitoring metrics : standardize gray field in logs; mini‑programs differentiate by version.

Gray application : real‑time gray log presentation and alerts.

Collection Standards

Two parts: field specifications and integration specifications. Field specs unify log fields across sources; integration specs embed gray status in page templates.

Field Specification

Integration Specification

<code><meta name="page-tag" content="env=spe,grey=true,version=0.0.1" /></code>

Collection Methods

Address SDK and cross‑platform container limitations:

Monitoring SDK : due to browser restrictions, use global variables or meta tags.

Cross‑platform containers : cannot access template content; rely on response headers for version info.

Two standard ways to notify the client of the current release state:

Inject version and gray status into response headers.

Inject into page template as global parameters during rendering.

Example for web SDK using meta tag:

<code><meta name="" content="{{ $page.isGreyPage }}" /></code>

Container side reads response headers, though it depends on client updates.

Gray Monitoring Alerts

Gray alert flow:

1. Subscribe to page release messages to store or delete release info (address, gray ratio, publisher, etc.).

2. Gray alerts poll every 5 minutes; compare recent logs (30 min) with longer‑term logs (12 h); if similarity < 50 % treat as new log.

Real‑Time Gray Monitoring

After metric collection, logs include a gray field to differentiate versions; comparison points include gray‑to‑online error‑rate ratio and error‑log trend.

Results

Current C‑side page monitoring coverage is 98%, including source, build, and mini‑program pages. For the main venue, monitoring identified over 10 module development issues across pre‑sale, pre‑heat, and official phases.

Monitoring Dashboard

With full coverage, a Datav dashboard provides a global view of core page anomalies.

Case: during Double‑11 evening, weex error logs spiked; investigation traced to a page JS execution error.

Metric Coverage

Crash logs rose due to a client push configuration; alerts enabled immediate response.

Gray Monitoring

Interactive business : after a new feature release, gray error ratio increased; timely rollback prevented larger impact.

Conclusion

With monitoring coverage, gray monitoring, and related capabilities, we can better avoid and detect issues, enabling faster and more reliable business operations. However, challenges remain in alarm subscription accuracy, metric analysis, and we will continue to improve monitoring capabilities.

frontendmonitoringoperationsgray releaseincident response
Taobao Frontend Technology
Written by

Taobao Frontend Technology

The frontend landscape is constantly evolving, with rapid innovations across familiar languages. Like us, your understanding of the frontend is continually refreshed. Join us on Taobao, a vibrant, all‑encompassing platform, to uncover limitless potential.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.