Bilibili Front‑End Error Monitoring: Architecture, SDK, White‑Screen Detection and Data Governance
Bilibili’s front‑end team built a custom “mirror” SDK and full‑stack monitoring platform that captures JavaScript and resource errors, detects white‑screens, logs user behavior offline, routes data through Kafka‑ClickHouse pipelines to visual dashboards, and provides one‑click alerts, now serving over 1,700 projects across 85% of business lines.
Since 2023 the Bilibili front‑end team has built a complete error‑monitoring solution covering SDK collection, data governance, dashboard integration and an APM visualisation layer. By August 2024 the system runs in more than 85% of business lines, over 1,700 projects, with 210+ projects already integrated into the APM platform and more than 300 one‑click alarm configurations.
Why a custom solution? Although many developers would reach for Sentry, Bilibili needed tighter integration with its own data pipeline and customisable reporting. The in‑house SDK (named mirror ) provides separate business and technical reporting channels, supports legacy scripts, enables multi‑dimensional analysis, visual dashboards, one‑click alerts and full traceability.
SDK Overview The mirror SDK has evolved over a year and a half. Its main capabilities include error capture, resource‑error handling, white‑screen detection, behaviour logging, offline logging and a plug‑in mechanism for lifecycle extensions.
Error capture The browser already offers global handlers. The SDK registers both synchronous and asynchronous listeners:
window.addEventListener('error', (error) => {
// analyse → report
});
window.addEventListener('unhandledrejection', (rejection) => {
// analyse → report
});After an error is caught, the SDK distinguishes between JavaScript runtime errors and resource‑loading errors. The following snippet shows the type‑discrimination logic:
export const handleJsError = ev => {
const target = ev.target;
if (!target || (ev.target && !ev.target?.localName)) {
// JS runtime error
}
if (target?.localName) {
// Resource loading error
}
};
export const handlerRejection = ev => {};Resource‑error handling The SDK uses localName to identify the resource type and src / href to obtain the failing URL before constructing the payload for reporting.
Stack parsing For JavaScript errors the SDK relies on the lightweight error‑stack‑parser library (≈2.2 KB gzipped) to extract file name, line, column and source code:
let stackFrame = ErrorStackParser.parse(!target ? ev : ev.error)[0];
let { fileName, columnNumber, lineNumber, source } = stackFrame;
const stack = source ? JSON.stringify(source.split('').join('')).split('./') : '';Error filtering To avoid duplicate reports and to respect business‑level white‑lists, the SDK generates a unique ID for each error using window.btoa and stores reported IDs in an in‑memory set:
export function getErrorId(val) {
return window.btoa(decodeURIComponent(encodeURIComponent(val)));
}
const getIsReportId = error => {
const id = getErrorId(error?.message || error?.fileName);
if (ERROR_ID.some(item => item === id)) {
console.warn(`Duplicate error, not reported, ${error?.message}`);
return false;
} else {
ERROR_ID.push(id);
return true;
}
};Configuration is fetched from an internal KV platform and merged (base + page + top) to produce the final filter set. Example of merging:
let config = deepMerge(baseConfig, pageConfig);
config = deepMerge(config, topConfig);
return new Promise((resolve, reject) => { /* … */ });During reporting the SDK checks white‑list rules for resources, UA strings, etc., and aborts if a rule matches.
let errorData_resource = handleResourceError(ev);
if (!errorData_resource || !errorData_resource.message.trim().length) return;
const filterListResource = options?.config?.white?.resource;
const isFilterResource = handlerFilter(errorData_resource, filterListResource);
const isFilterUa = handlerFilterUa(options?.config?.white?.ua);
if (isFilterUa) return;
if (isFilterResource) return;
if (!getIsReportId(errorData_resource)) return;
// finally reportWhite‑screen detection Because SPA pages can show a blank screen after a crash, the team adopted a “vertical sampling” method using document.elementsFromPoint . The algorithm samples nine vertical and nine horizontal points and checks whether any DOM element is present:
for (let i = 1; i <= 9; i += 2) {
const xElements = document.elementsFromPoint((window.innerWidth * i) / 10, window.innerHeight / 2);
const yElements = document.elementsFromPoint(window.innerWidth / 2, (window.innerHeight * i) / 10);
}The sampled points are then examined for expected classes, IDs or tags to decide whether a white‑screen has occurred.
Behaviour and offline logs The SDK records user actions, page lifecycle events, click/scroll history and request outcomes. Behaviour logs are kept in localStorage (default size 10) and are cleared when full without errors. Offline logs are persisted in IndexedDB (max 3 MB) and follow a FIFO policy. Both logs can be correlated with error reports for root‑cause analysis.
Data pipeline & visualization Collected data flows through Kafka, is written to ClickHouse (CK) and then consumed by the internal APM platform. The pipeline supports real‑time metric aggregation, offline BI tables (DWD, DWB, DWS) and a One‑Service SQL engine that can return aggregation results and clustering tags in a single query. Example of a multi‑CTE query used for aggregation:
WITH t_search AS (
SELECT * FROM ${DWD_TABLE} WHERE log_date >= "date"
),
t_num AS (
SELECT msg, COUNT(*) AS num_report FROM t_search GROUP BY msg
),
t_agg_city AS (
SELECT msg, ip_city AS _name, COUNT(*) AS _num,
ROW_NUMBER() OVER (PARTITION BY msg ORDER BY COUNT(*) DESC) AS row_num
FROM t_search GROUP BY msg, ip_city
)
SELECT *, t_agg_city._num / t_num.num_report AS city_ratio
FROM t_num
JOIN t_agg_city ON t_agg_city.msg = t_num.msg AND t_agg_city.row_num = 1
WHERE t_agg_city._num / t_num.num_report >= "threshold";These queries power dashboards that show error distribution by browser, city, version, stack file type, as well as derived indices such as “black‑industry index”, “danger index” and “repair index”.
One‑click alarm The original log‑based alarm was heavy and required PQL expertise. The new one‑click alarm lets front‑end owners configure simple thresholds (e.g., 24‑hour growth rate, minimum count) and receives notifications via an internal webhook that forwards to enterprise WeChat.
Future roadmap The team plans to keep improving the mirror SDK, refine the health‑score algorithm, enhance data‑governance, reduce false‑positive alerts and extend support to mini‑programs and other platforms.
In summary, Bilibili’s front‑end monitoring system has progressed from a prototype to a production‑grade platform that covers error capture, white‑screen detection, behaviour logging, data pipelines, visual analytics and automated alerting, providing a solid foundation for reliable front‑end operations at massive scale.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.