How a Financial Firm Built a Scalable Edge‑Stored APM System for Microservices
This article describes how a securities company tackled the challenges of distributed‑system observability by designing and deploying a self‑developed application performance monitoring platform that supports flexible integration, dynamic metric collection, edge storage, and cross‑center synchronization, delivering measurable improvements in monitoring coverage, alert effectiveness, and bandwidth usage.
Background
As micro‑service frameworks were gradually adopted, the distributed architecture of the company faced new observability challenges. Complex business logic and a large number of third‑party “chimney” systems made performance monitoring increasingly difficult, prompting the development of a richer, more flexible monitoring solution.
Key Monitoring Indicators
The core metrics for a robust monitoring system remain:
Coverage : breadth of monitored targets (applications, systems, databases, middleware, hardware, network) and depth of metrics (system, performance, logs, business).
Alert Effectiveness : reduction of noise, proper routing, and auditability of alerts.
Observability : ability to infer internal system state from logs, metrics, and tracing data.
Monitoring Coverage
Coverage is evaluated by linking CMDB inventories with monitoring metric catalogs and by checking log‑collection status for legacy systems. This approach ensures both target‑level and metric‑level completeness.
Alert Effectiveness
After achieving coverage, alerts must be de‑noised, correctly routed, and audited. Management mechanisms and collaboration tools are required to guarantee that each alert is noticed and acted upon.
Observability
Observability goes beyond specific metrics; it emphasizes analysis of generated data (logs, metrics, tracing) to infer system state, which is essential for troubleshooting large‑scale cloud‑native distributed systems.
Financial‑Sector Specific Requirements
Multiple technology stacks and numerous third‑party black‑box systems require a unified monitoring approach.
Data desensitization must be balanced with development‑operations collaboration.
Monitoring solutions must not jeopardize the stability of transaction‑critical systems.
Construction Approach
The monitoring architecture is divided into three parts: monitoring access management, monitoring services, and the monitoring platform. Both reliable platform components and well‑defined processes are essential.
1. Monitoring Access Management
During development, clear standards are established for logging, exception handling, health checks, and event reporting, and monitoring evaluation is embedded into architecture reviews.
Baseline Configuration
A baseline defines default log collection and metric gathering for common systems and middleware, enabling automated end‑to‑end deployment.
Evaluation Stages
Beyond the baseline, dedicated evaluation assesses high‑availability failover, critical transaction monitoring, and observability, requiring dedicated personnel and processes.
2. Monitoring Services
These services package alerts, work‑order integration, and observability data for developers and operators.
Alert‑workorder integration standardizes incident handling and provides metrics on response times.
Observability support offers controlled access to logs, metrics, analysis, events, and interaction data.
3. Monitoring Platform
The platform must support multi‑language, multi‑center deployments and integrate with asset management, delivery tools, ticketing, and effectiveness analysis systems.
Technical Solution
1. Product Positioning
The company already operates a hybrid monitoring ecosystem (Zabbix, Prometheus, APM, service governance). The new APM system complements Prometheus with custom event analysis, edge storage, and flexible data pipelines.
2. System Design
The APM is a Java‑centric distributed system designed for high throughput and massive storage. It uses edge storage (BitCask) for immutable event details, reducing bandwidth consumption and enhancing privacy.
Event Analysis & Storage
Events flow through a configurable Kafka‑based pipeline with six built‑in processors. Depending on the event type, one or more ordered processors handle the message before persisting it to one of three storage backends, including a BitCask‑based edge store that achieves ~10 K QPS on a 4‑core, 8 GB machine.
Data Query Service
Queries are defined as Velocity templates stored in Zookeeper and exposed via RPC and HTTP APIs, allowing downstream systems to retrieve metric data flexibly.
Cross‑Center Synchronization
To reduce inter‑data‑center bandwidth, the system implements a cross‑center call‑chain storage and query mechanism that minimizes synchronization traffic.
Results and Outlook
Since its 2018 launch, the self‑developed APM has become the standard performance‑monitoring solution within the company, complementing Prometheus and Zabbix. It has reduced network bandwidth consumption, supported multi‑center deployments, and enabled seamless integration of both self‑developed and third‑party systems. Future work will focus on real‑time fault prediction, automated analysis, and further edge‑storage innovations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
