How Bilibili Built a Scalable Front‑End Error Monitoring System from Scratch

This article details Bilibili's end‑to‑end front‑end error monitoring solution, covering the custom SDK, error capture and classification, unique ID generation, filtering, white‑screen detection, data pipelines, APM visualisation, lifecycle plugins, one‑click alerts, and future roadmap, all backed by real‑world metrics and code examples.

Architect
Architect
Architect
How Bilibili Built a Scalable Front‑End Error Monitoring System from Scratch

Background

Since 2023 the Bilibili front‑end team has iteratively built a comprehensive error‑monitoring platform, now running in over 85% of business lines and more than 1,700 projects. By early 2024 the APM platform integrated 210+ projects, and the one‑click alarm feature serves over 300 projects.

Why Build a Custom Solution?

Although Sentry is a popular choice, Bilibili required tighter integration with its data pipeline and customisable capabilities such as:

Self‑hosted SDK with separate business and technical reporting channels, supporting legacy scripts.

Fine‑grained data cleaning, filtering, and multi‑dimensional analysis.

Custom visual dashboards, one‑click alerts, traceability, and direct internal platform integration.

SDK Overview (bili‑mirror)

The bili‑mirror SDK, after more than a year of iteration, provides the following core functions:

// Synchronous error capture
window.addEventListener('error', (error) => {
  // analyse error → report
});

// Asynchronous (Promise) rejection capture
window.addEventListener('unhandledrejection', (rejection) => {
  // analyse error → report
});

Errors are classified into JavaScript runtime errors and resource loading errors. The SDK distinguishes them as follows:

export const handleJsError = ev => {
  const target = ev.target;
  if (!target || (ev.target && !ev.target?.localName)) {
    // JS runtime error
  }
  if (target?.localName) {
    // Resource loading error
  }
};

export const handlerRejection = ev => {};

Resource Error Handling

Resource errors are identified by localName (e.g., IMG, SCRIPT) and the failing src or href, then assembled into a report payload.

Unique Error ID Generation

To avoid duplicate reporting within a session, a unique ID is generated using window.btoa on the error message or filename:

export function getErrorId(val) {
  return window.btoa(decodeURIComponent(encodeURIComponent(val)));
}

const getIsReportId = error => {
  const id = getErrorId(error?.message || error?.fileName);
  if (ERROR_ID.some(item => item === id)) {
    console.warn(`Duplicate error, not reported, ${error?.message}`);
    return false;
  } else {
    ERROR_ID.push(id);
    return true;
  }
};

Filtering Configuration

Filtering logic merges base, page, and top‑level KV configurations, then applies whitelist/blacklist rules for resources, user‑agents, and custom messages before reporting.

let config = deepMerge(baseConfig, pageConfig);
config = deepMerge(config, topConfig);

let errorData_resource = handleResourceError(ev);
if (!errorData_resource || !errorData_resource.message.trim().length) return;
const filterListResource = options?.config?.white?.resource;
const isFilterResource = handlerFilter(errorData_resource, filterListResource);
const isFilterUa = handlerFilterUa(options?.config?.white?.ua);
if (isFilterUa) return;
if (isFilterResource) return;
if (!getIsReportId(errorData_resource)) return;
// report

White‑Screen Detection

White‑screen issues in SPA applications are detected using a key‑point sampling strategy based on document.elementsFromPoint. The team chose vertical sampling for a good balance of accuracy, complexity, and performance.

for (let i = 1; i <= 9; i += 2) {
  const xElements = document.elementsFromPoint((window.innerWidth * i) / 10, _global.innerHeight / 2);
  const yElements = document.elementsFromPoint(_window.innerWidth / 2, (_global.innerHeight * i) / 10);
}

When a white screen is first detected, a polling mechanism validates the condition before triggering a correction workflow.

Data Pipeline & APM Backend

Data flows through Kafka, is stored in ClickHouse, and is further processed for real‑time and offline analytics. The pipeline evolved through three stages:

Early stage: Kafka → ES (ops‑log) and Hive (BI) with limited topic expansion.

Mid stage: Added data‑governance layer (DWB) to reduce raw volume and moved offline aggregation upstream.

Current stage: Split governance into temporary (TMP) and final (DWD) tables, introduced OneService for real‑time calculations, and built a custom APM visualisation platform.

The APM platform aggregates error counts, calculates distribution metrics (browser, city, version), and computes growth ratios and earliest occurrence dates using SQL such as:

WITH t_search AS (
  SELECT * FROM ${DWD_TABLE} WHERE log_date >= "date"
),
 t_num AS (
  SELECT msg, COUNT(*) AS num_report FROM t_search GROUP BY msg
),
 t_agg_city AS (
  SELECT msg, ip_city AS _name, COUNT(*) AS _num,
         ROW_NUMBER() OVER (PARTITION BY msg ORDER BY COUNT(*) DESC) AS row_num
  FROM t_search GROUP BY msg, ip_city
)
SELECT *, t_agg_city._num / t_num.num_report AS browser_max_ratio
FROM t_num
JOIN t_agg_city ON t_agg_city.msg = t_num.msg AND t_agg_city.row_num = 1
WHERE t_agg_city._num / t_num.num_report >= "threshold";

Health Score Algorithm

The platform computes a health score (0‑100) for each project by normalising key indicators (LCP, white‑screen error count, etc.), weighting them based on business impact, and mapping the aggregated score to a smooth curve.

Lifecycle Plugins

Mirror supports lifecycle plugins with before and after hooks, enabling custom logic injection per event type. Example:

class MirrorXxPlugin {
  mirrorHandleBefore(type, data) {
    return new Promise(resolve => {
      if (type === 'error' || type === 'unhandledrejection') {
        const isMatch = errorFilterList.some(str => data?.message?.includes(str));
        resolve(!isMatch);
      } else if (type === 'resource') {
        resolve(true);
      } else {
        resolve(true);
      }
    });
  }
  mirrorHandleAfter() { return Promise.resolve(); }
}
const mirrorFilterPlugin = new MirrorXxPlugin();

Offline & Behaviour Logs

Behaviour logs capture user actions locally (default stack size 10) and are cleared when full without errors. Offline logs are persisted in IndexedDB with a 3 MB cap, using a FIFO policy. Both are unified for reporting, aiding error correlation and root‑cause analysis.

One‑Click Alarm

The original log‑based alarm incurred high real‑time computation costs. The new one‑click alarm lets front‑end developers set simple thresholds (e.g., growth ratio, minimum count) without writing PQL, and integrates with the internal alarm webhook to send notifications via enterprise WeChat.

Future Directions

Continue SDK optimisation for Bilibili‑specific scenarios.

Iterate the health‑score algorithm.

Enhance data governance and accuracy.

Refine alarm precision and reduce noise.

Deepen performance module capabilities in APM.

Explore support for additional platforms such as mini‑programs.

The overall architecture diagram (see image below) illustrates the end‑to‑end flow from SDK capture to APM visualisation and alerting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SDKdata pipelineAPMAlertingerror trackingBilibilifrontend monitoring
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.