How Baidu’s Ad Hosting Team Built a Scalable Front‑End Exception Monitoring System
This article shares Baidu’s ad‑hosting team experience in designing, collecting, alerting, investigating, and remediating front‑end exceptions—covering generic and business‑specific error tracking, data protocols, monitoring strategies, alert tuning, and practical governance to improve user experience and ad performance.
1. Introduction
Rapid product iteration requires monitoring beyond simple behavior and performance metrics. Front‑end exception monitoring captures the real user experience, enabling early detection of issues that could cause revenue loss.
2. Business Background
The team supports Baidu’s ad‑hosting platform, which serves millions of visits across web pages, mini‑programs, and HN (React‑Native‑like) pages. The goals are smooth reading and interaction for end users and high‑quality guarantees for advertisers.
2.1 Problem Statement
After stabilising back‑end monitoring, several front‑end problems remained hard to trace: rendering anomalies, static‑resource load failures, API errors, and JavaScript execution exceptions. These issues often appear only under low‑traffic or A/B‑test scenarios, making early detection critical.
3. Exception Collection
3.1 Generic Exception Collection
Exceptions are sent as telemetry points to a collection service. Two main scenarios are covered:
Resource‑load failures (e.g., missing images or script files)
Runtime errors caused by compatibility issues or unhandled edge cases
Two practical approaches for resource‑load errors:
Attach onerror handlers to resources (e.g., via script‑ext‑html‑webpack‑plugin).
Listen globally with window.addEventListener('error', fn, true) to capture errors during the capture phase.
For runtime errors the team uses: window.onerror = fn and also captures unhandled promise rejections:
window.addEventListener('unhandledrejection', fn)Framework‑specific handlers are employed when available:
React : componentDidCatch (error boundaries). See React docs: https://reactjs.org/docs/error-boundaries.html
Vue : Vue.config.errorHandler = (err, vm, info) => {} – supports lifecycle, custom‑event and v-on errors in recent versions.
When “Script error” appears (cross‑origin script failure), adding the crossorigin attribute to script tags and configuring Access‑Control‑Allow‑Origin on the CDN enables full error details.
3.2 Business‑Specific Exception Collection
Beyond generic telemetry, the team defines custom business exception points that carry additional context (e.g., user ID, product line ID). This enriches downstream analysis and helps locate root causes invisible in generic stack traces.
3.3 Collection Protocol
A unified schema stores three top‑level keys:
{
"exception": /* stack info */,
"request": /* page request info */,
"meta": {
"xxx": /* business fields */,
"extra": { /* extensible custom fields */ }
}
}The extra field is stored as a JSON string to avoid costly schema changes in BaikalDB (Baidu’s internal column‑store database, see https://github.com/baidu/BaikalDB).
4. Monitoring & Alerting
Collected data populates a wide “exception” table. Monitoring items are defined by filtering columns (e.g., URL query, business line) and can be combined into aggregated conditions.
Key alert‑tuning dimensions:
Aggregation window : Real‑time (≈30 s) for high‑impact ad‑conversion errors; longer windows for less volatile metrics.
Trigger mechanism : Threshold‑based for stable metrics; deviation‑based (day‑over‑day, week‑over‑week) for fluctuating patterns.
Alert recipients : Multi‑channel (email, IM, SMS) with no single‑point dependency.
Practical refinements include filtering out crawler traffic, focusing on commercial traffic, and maintaining a blacklist for harmless errors such as cross‑origin “Script error”. Example refined condition:
businessLine = 'xxx' && errorType = 'js' && commercialFlag != '' && ua NOT LIKE 'crawler%' && errorMessage NOT LIKE 'Script error'5. Exception Investigation
When an alert fires, investigators aggregate dimensions (IP, UA, device ID, URL) to pinpoint the root cause. Example: a cluster of resource‑load failures was traced to a regional network outage after IP aggregation.
To decode minified stack traces, source‑map files are uploaded to the error‑analysis platform, enabling direct mapping to original source locations.
6. Exception Remediation
Remediation follows a four‑step loop:
Identify recurring business‑impactful exceptions from generic monitoring.
Define dedicated monitoring items for each exception type.
Deploy targeted fixes (e.g., CDN usage, resource compression, retry mechanisms, code refactoring).
Validate improvement via post‑deployment metrics.
Exception types are categorised into JS execution, API errors, image‑resource failures, and script‑resource failures. Data cleaning removes test traffic and applies a blacklist for non‑impactful errors.
Normalization uses “exceptions per ad click” to compare across product lines with differing traffic volumes. For network‑related failures, baseline targets are set using the 80th percentile of historical data.
Typical remediation actions include:
Switching to CDN links and compressing images (or using WebP).
Implementing retry logic for API, script, and image requests.
Iteratively optimising JS execution errors by creating dedicated monitors, deploying fixes, and observing metric drops.
After systematic remediation, most exception metrics showed measurable decline, and A/B experiments confirmed improvements in app‑download and lead conversion rates.
7. Conclusion
The team built an end‑to‑end front‑end exception monitoring and governance pipeline that transforms raw error signals into actionable business insights, reduces loss, and boosts ad performance. Ongoing work focuses on further automation, richer data models, and tighter integration with task‑management systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
