How Prism Transformed Front‑End Monitoring at Scale: Architecture, Challenges & Insights
This article details the design, challenges, and solutions behind Prism, a self‑built front‑end monitoring platform that collects multi‑device SDK data, processes it through Kafka, Flink and ClickHouse, visualizes metrics, integrates with A/B testing, and outlines future enhancements for broader enterprise adoption.
1. Product Overview
Prism is a front‑end monitoring system developed by the Experience Technology front‑end performance team, offering performance evaluation, quality assessment, error alerts, and custom event tracking for more than 100 company projects.
2. Background
Studies have shown that page latency directly impacts revenue, highlighting the need for data‑driven decisions as product scale grows. Without quantitative data, decisions are biased; therefore, comprehensive A/B testing and small‑traffic mechanisms become essential.
3. Exploration Process
3.1 Challenges
Developing a full‑stack front‑end monitoring platform involves SDK development for multiple endpoints, unified data formats, cross‑technology learning, high‑concurrency data processing, and stable cluster maintenance.
3.2 Solutions
3.2.1 Data Collection
Prism uses an intrusive front‑end SDK to collect rich dimensions: web (stay time, request errors, page errors), app (device, network, version, OS, request errors), and mini‑program data.
3.2.2 Data Ingestion
Collected data follows a unified format and is received by a Node.js multi‑node service behind CLB, filtered for dirty data, then written to Kafka topics for downstream processing.
3.2.3 Data Cleaning
The pipeline employs Kafka + Flink + ClickHouse. Initially Spark was used, then Flink (Scala) replaced it to handle growing data volume and cross‑team coordination, achieving better performance.
3.2.4 Backend Service & Visualization
Node.js services expose 40+ metrics and 20+ OpenAPI endpoints (daily active users, interface performance, custom events, error details). Data is visualized on the Prism platform with charts for various products.
3.2.5 Alert Service
Alerts are sent via Enterprise WeChat bots using configurable webhook URLs and custom rules (max affected users, error count, time thresholds). Clicking a notification shows error details for rapid debugging.
3.2.6 Overall Architecture & Maintenance
The system integrates Alertmanager, Prometheus, Grafana, Node.js, and WeChat bots. Metrics are scraped by Prometheus, visualized in Grafana, and alerts trigger automated restart scripts.
4. Data Output Capabilities
4.1 Integration with A/B Testing (Picasso)
Picasso, the company’s A/B testing platform, relies on Prism’s data for experiment analysis. Prism’s metrics support multiple marketing experiments, handling massive data spikes during peak periods.
4.2 Cross‑Department Data Collaboration
4.2.1 Data Sharing via OpenAPI
Prism provides OpenAPI endpoints for external consumption, enabling other platforms (e.g., the simulation platform) to retrieve interface call details for testing and regression.
4.2.2 Custom Data Promotion
More than 20 products have integrated custom events, notably the “优选优咪” operations and warehouse apps, using Prism data to monitor employee behavior and guide product iteration.
5. Future Plans
Prism now covers about 90% of front‑end projects, but there remain UI and interaction refinements. The team plans to migrate backend services to the company’s data‑warehouse platform, enhance data depth, and continue learning from industry‑leading monitoring tools.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xingsheng Youxuan Technology Community
Xingsheng Youxuan Technology Official Account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
