Beidou Frontend Monitoring System: Architecture, Challenges, and Solutions

The article details the design, architecture, and operational challenges of the Beidou frontend monitoring platform at 58 Group, covering SDK management, behavior trace logging, front‑back link integration, performance optimizations, minute‑level alerting, and permission management.

58 Tech
58 Tech
58 Tech
Beidou Frontend Monitoring System: Architecture, Challenges, and Solutions

The Beidou frontend monitoring system is 58 Group's foundational infrastructure for online quality monitoring across the full frontend technology stack, aggregating traffic, performance, JavaScript errors, API and resource anomalies, and providing a data platform for internal services.

Background: increasing business scenarios, traffic load, and system scale have driven the need for comprehensive monitoring, especially as H5 pages face fragmentation across apps, devices, frameworks, and backend services.

System architecture consists of five layers: SDK Layer (JSSDK and RNSDK with native data), Data Collect Service Layer (receives and cleans data with dynamic sampling), Storage Layer (uses Druid for pre‑aggregated data and Elasticsearch for detailed logs), Core Service Layer (alert analysis, sampling rate calculation, third‑party log integration, cluster‑deployed Node services), and Web Dashboard (permission, project, log query, and visualization).

SDK Fragmentation Management – Three solutions were evaluated. Solution 1 uses a dynamic loader script:

<script type="text/javascript" crossorigin="anonymous" src="https://j1.58cdn.com.cn/beidou-sdk/browser/bundle.lazyload.js"></script>

, which updates the SDK version at runtime but introduces late injection and extra network requests. Solution 2 relies on npm dependencies with caret (^) versions, still requiring business‑side releases. Solution 3 embeds a fixed CDN script without versioning:

<script type="text/javascript" crossorigin="anonymous" src="https://j1.58cdn.com.cn/beidou-sdk/browser/bundle.min.js"></script>

, allowing forced updates via CDN cache invalidation; this approach was chosen.

Additional measures include pre‑designing incremental SDK features, gray‑release testing, and an internal IM group for release announcements.

Behavior Trace Logging – Supports web and app environments, collecting seven log types (performance, JS errors, custom logs, interactions, API, resources, hybrid calls). Integration strategies are “connect” (reading WMDA identifiers and merging logs) and “leverage” (using the existing Wlog component in the app factory). Logs are decrypted, transformed, and stored in Elasticsearch.

Front‑Back Link – The JSSDK injects an sw8 header for each API request, which WTrace captures on the backend, enabling full‑stack performance and request tracing while employing sampling to limit load.

Performance Optimizations for Aggregation Pages – Dynamic hourly sampling reduces stored performance data to a configurable 1 million points per project, calculated as

hourly_sampling_rate = 100w / total_project_performance_last_24h

. Sampling is performed in the data‑collect service using Redis‑stored rates. This cut Druid daily storage from 540 million to 120 million rows and halved query latency.

SQL query consolidation further improved speed. Original approach used separate queries for PV and error counts; the optimized single query is:

SELECT TIME_FLOOR(__time,PT24H,'Asia/Shanghai') timestamp, COUNT(DISTINCT pid) as pv, COUNT(content)/COUNT(DISTINCT pid) as exceptionAvg, COUNT(resourceUrl)/COUNT(DISTINCT pid) as resAvg, COUNT(apiUrl)/COUNT(DISTINCT pid) as apiAvg FROM hdp_ubu_tech_wei_beidou_data WHERE __time>='2020-11-12 00:00:00+08:00' AND __time<='2020-11-12 23:59:59+08:00' AND projectId='100' GROUP BY 1

, reducing four queries to one and cutting average node API time from 1573 ms to 781 ms (≈50% improvement).

Minute‑Level Alerting – After evaluating Flink‑based streaming versus cron jobs, the team adopted a distributed scheduled‑task model using Redlock, where each metric runs an independent minute‑level task across the cluster. Alert logic reads recent monitoring data, fetches threshold configurations from MySQL, and compares values, with Druid serving as the fast aggregation source.

Permission Management – Extends RBAC with organizational hierarchy (static “static” model) to isolate admins and members by organization, reducing registration overhead and enabling organization‑wide data reports. Dynamic “dynamic” model synchronizes organizational changes semi‑automatically and allows project transfer between organizations, similar to GitLab’s namespace transfer.

In summary, the Beidou team continuously addresses scaling, fragmentation, and maintainability challenges, plans further feature refinements, and invites community collaboration. The article concludes with a recruitment notice for 58 Group’s user‑growth frontend team.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

frontendmonitoringarchitectureobservabilityAlerting
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.