Operations 16 min read

How a Financial Firm Built a Scalable Edge‑Stored APM System for Microservices

This article describes how a securities company tackled the challenges of distributed‑system observability by designing and deploying a self‑developed application performance monitoring platform that supports flexible integration, dynamic metric collection, edge storage, and cross‑center synchronization, delivering measurable improvements in monitoring coverage, alert effectiveness, and bandwidth usage.

dbaplus Community

Jan 12, 2024

How a Financial Firm Built a Scalable Edge‑Stored APM System for Microservices

Background

As micro‑service frameworks were gradually adopted, the distributed architecture of the company faced new observability challenges. Complex business logic and a large number of third‑party “chimney” systems made performance monitoring increasingly difficult, prompting the development of a richer, more flexible monitoring solution.

Key Monitoring Indicators

The core metrics for a robust monitoring system remain:

Coverage : breadth of monitored targets (applications, systems, databases, middleware, hardware, network) and depth of metrics (system, performance, logs, business).

Alert Effectiveness : reduction of noise, proper routing, and auditability of alerts.

Observability : ability to infer internal system state from logs, metrics, and tracing data.

Monitoring Coverage

Coverage is evaluated by linking CMDB inventories with monitoring metric catalogs and by checking log‑collection status for legacy systems. This approach ensures both target‑level and metric‑level completeness.

Alert Effectiveness

After achieving coverage, alerts must be de‑noised, correctly routed, and audited. Management mechanisms and collaboration tools are required to guarantee that each alert is noticed and acted upon.

Observability

Observability goes beyond specific metrics; it emphasizes analysis of generated data (logs, metrics, tracing) to infer system state, which is essential for troubleshooting large‑scale cloud‑native distributed systems.

Financial‑Sector Specific Requirements

Multiple technology stacks and numerous third‑party black‑box systems require a unified monitoring approach.

Data desensitization must be balanced with development‑operations collaboration.

Monitoring solutions must not jeopardize the stability of transaction‑critical systems.

Construction Approach

The monitoring architecture is divided into three parts: monitoring access management, monitoring services, and the monitoring platform. Both reliable platform components and well‑defined processes are essential.

1. Monitoring Access Management

During development, clear standards are established for logging, exception handling, health checks, and event reporting, and monitoring evaluation is embedded into architecture reviews.

Baseline Configuration

A baseline defines default log collection and metric gathering for common systems and middleware, enabling automated end‑to‑end deployment.

Evaluation Stages

Beyond the baseline, dedicated evaluation assesses high‑availability failover, critical transaction monitoring, and observability, requiring dedicated personnel and processes.

2. Monitoring Services

These services package alerts, work‑order integration, and observability data for developers and operators.

Alert‑workorder integration standardizes incident handling and provides metrics on response times.

Observability support offers controlled access to logs, metrics, analysis, events, and interaction data.

3. Monitoring Platform

The platform must support multi‑language, multi‑center deployments and integrate with asset management, delivery tools, ticketing, and effectiveness analysis systems.

Monitoring platform functional breakdown

Technical Solution

1. Product Positioning

The company already operates a hybrid monitoring ecosystem (Zabbix, Prometheus, APM, service governance). The new APM system complements Prometheus with custom event analysis, edge storage, and flexible data pipelines.

2. System Design

The APM is a Java‑centric distributed system designed for high throughput and massive storage. It uses edge storage (BitCask) for immutable event details, reducing bandwidth consumption and enhancing privacy.

Event Analysis & Storage

Events flow through a configurable Kafka‑based pipeline with six built‑in processors. Depending on the event type, one or more ordered processors handle the message before persisting it to one of three storage backends, including a BitCask‑based edge store that achieves ~10 K QPS on a 4‑core, 8 GB machine.

Data Query Service

Queries are defined as Velocity templates stored in Zookeeper and exposed via RPC and HTTP APIs, allowing downstream systems to retrieve metric data flexibly.

Cross‑Center Synchronization

To reduce inter‑data‑center bandwidth, the system implements a cross‑center call‑chain storage and query mechanism that minimizes synchronization traffic.

Results and Outlook

Since its 2018 launch, the self‑developed APM has become the standard performance‑monitoring solution within the company, complementing Prometheus and Zabbix. It has reduced network bandwidth consumption, supported multi‑center deployments, and enabled seamless integration of both self‑developed and third‑party systems. Future work will focus on real‑time fault prediction, automated analysis, and further edge‑storage innovations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems APM financial technology Edge Storage

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.