How Bilibili Built a Scalable Front‑End Error Monitoring System from Scratch
This article details Bilibili's end‑to‑end front‑end error monitoring solution, covering the custom SDK, error capture and classification, unique ID generation, filtering, white‑screen detection, data pipelines, APM visualisation, lifecycle plugins, one‑click alerts, and future roadmap, all backed by real‑world metrics and code examples.
Background
Since 2023 the Bilibili front‑end team has iteratively built a comprehensive error‑monitoring platform, now running in over 85% of business lines and more than 1,700 projects. By early 2024 the APM platform integrated 210+ projects, and the one‑click alarm feature serves over 300 projects.
Why Build a Custom Solution?
Although Sentry is a popular choice, Bilibili required tighter integration with its data pipeline and customisable capabilities such as:
Self‑hosted SDK with separate business and technical reporting channels, supporting legacy scripts.
Fine‑grained data cleaning, filtering, and multi‑dimensional analysis.
Custom visual dashboards, one‑click alerts, traceability, and direct internal platform integration.
SDK Overview (bili‑mirror)
The bili‑mirror SDK, after more than a year of iteration, provides the following core functions:
// Synchronous error capture
window.addEventListener('error', (error) => {
// analyse error → report
});
// Asynchronous (Promise) rejection capture
window.addEventListener('unhandledrejection', (rejection) => {
// analyse error → report
});Errors are classified into JavaScript runtime errors and resource loading errors. The SDK distinguishes them as follows:
export const handleJsError = ev => {
const target = ev.target;
if (!target || (ev.target && !ev.target?.localName)) {
// JS runtime error
}
if (target?.localName) {
// Resource loading error
}
};
export const handlerRejection = ev => {};Resource Error Handling
Resource errors are identified by localName (e.g., IMG, SCRIPT) and the failing src or href, then assembled into a report payload.
Unique Error ID Generation
To avoid duplicate reporting within a session, a unique ID is generated using window.btoa on the error message or filename:
export function getErrorId(val) {
return window.btoa(decodeURIComponent(encodeURIComponent(val)));
}
const getIsReportId = error => {
const id = getErrorId(error?.message || error?.fileName);
if (ERROR_ID.some(item => item === id)) {
console.warn(`Duplicate error, not reported, ${error?.message}`);
return false;
} else {
ERROR_ID.push(id);
return true;
}
};Filtering Configuration
Filtering logic merges base, page, and top‑level KV configurations, then applies whitelist/blacklist rules for resources, user‑agents, and custom messages before reporting.
let config = deepMerge(baseConfig, pageConfig);
config = deepMerge(config, topConfig);
let errorData_resource = handleResourceError(ev);
if (!errorData_resource || !errorData_resource.message.trim().length) return;
const filterListResource = options?.config?.white?.resource;
const isFilterResource = handlerFilter(errorData_resource, filterListResource);
const isFilterUa = handlerFilterUa(options?.config?.white?.ua);
if (isFilterUa) return;
if (isFilterResource) return;
if (!getIsReportId(errorData_resource)) return;
// reportWhite‑Screen Detection
White‑screen issues in SPA applications are detected using a key‑point sampling strategy based on document.elementsFromPoint. The team chose vertical sampling for a good balance of accuracy, complexity, and performance.
for (let i = 1; i <= 9; i += 2) {
const xElements = document.elementsFromPoint((window.innerWidth * i) / 10, _global.innerHeight / 2);
const yElements = document.elementsFromPoint(_window.innerWidth / 2, (_global.innerHeight * i) / 10);
}When a white screen is first detected, a polling mechanism validates the condition before triggering a correction workflow.
Data Pipeline & APM Backend
Data flows through Kafka, is stored in ClickHouse, and is further processed for real‑time and offline analytics. The pipeline evolved through three stages:
Early stage: Kafka → ES (ops‑log) and Hive (BI) with limited topic expansion.
Mid stage: Added data‑governance layer (DWB) to reduce raw volume and moved offline aggregation upstream.
Current stage: Split governance into temporary (TMP) and final (DWD) tables, introduced OneService for real‑time calculations, and built a custom APM visualisation platform.
The APM platform aggregates error counts, calculates distribution metrics (browser, city, version), and computes growth ratios and earliest occurrence dates using SQL such as:
WITH t_search AS (
SELECT * FROM ${DWD_TABLE} WHERE log_date >= "date"
),
t_num AS (
SELECT msg, COUNT(*) AS num_report FROM t_search GROUP BY msg
),
t_agg_city AS (
SELECT msg, ip_city AS _name, COUNT(*) AS _num,
ROW_NUMBER() OVER (PARTITION BY msg ORDER BY COUNT(*) DESC) AS row_num
FROM t_search GROUP BY msg, ip_city
)
SELECT *, t_agg_city._num / t_num.num_report AS browser_max_ratio
FROM t_num
JOIN t_agg_city ON t_agg_city.msg = t_num.msg AND t_agg_city.row_num = 1
WHERE t_agg_city._num / t_num.num_report >= "threshold";Health Score Algorithm
The platform computes a health score (0‑100) for each project by normalising key indicators (LCP, white‑screen error count, etc.), weighting them based on business impact, and mapping the aggregated score to a smooth curve.
Lifecycle Plugins
Mirror supports lifecycle plugins with before and after hooks, enabling custom logic injection per event type. Example:
class MirrorXxPlugin {
mirrorHandleBefore(type, data) {
return new Promise(resolve => {
if (type === 'error' || type === 'unhandledrejection') {
const isMatch = errorFilterList.some(str => data?.message?.includes(str));
resolve(!isMatch);
} else if (type === 'resource') {
resolve(true);
} else {
resolve(true);
}
});
}
mirrorHandleAfter() { return Promise.resolve(); }
}
const mirrorFilterPlugin = new MirrorXxPlugin();Offline & Behaviour Logs
Behaviour logs capture user actions locally (default stack size 10) and are cleared when full without errors. Offline logs are persisted in IndexedDB with a 3 MB cap, using a FIFO policy. Both are unified for reporting, aiding error correlation and root‑cause analysis.
One‑Click Alarm
The original log‑based alarm incurred high real‑time computation costs. The new one‑click alarm lets front‑end developers set simple thresholds (e.g., growth ratio, minimum count) without writing PQL, and integrates with the internal alarm webhook to send notifications via enterprise WeChat.
Future Directions
Continue SDK optimisation for Bilibili‑specific scenarios.
Iterate the health‑score algorithm.
Enhance data governance and accuracy.
Refine alarm precision and reduce noise.
Deepen performance module capabilities in APM.
Explore support for additional platforms such as mini‑programs.
The overall architecture diagram (see image below) illustrates the end‑to‑end flow from SDK capture to APM visualisation and alerting.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
