Operations 17 min read

How to Build a Scalable Unified Monitoring System for Distributed Trading Platforms

This article details the design and implementation of a unified, real‑time monitoring solution for a multi‑node distributed trading system, covering data modeling, low‑intrusion data flow, Kafka‑based architecture, C++ consumer components, performance testing, and future optimization directions.

dbaplus Community
dbaplus Community
dbaplus Community
How to Build a Scalable Unified Monitoring System for Distributed Trading Platforms

Introduction

The new generation distributed trading system of Haitong Securities consists of many nodes and components, creating challenges such as cross‑WAN monitoring, heterogeneous metrics, and massive log volumes. To achieve unified monitoring with minimal intrusion, a Kafka‑based publish/subscribe pipeline was designed to collect metric logs from each process and aggregate them centrally.

Monitoring Goals

The system targets three properties: unified access for all nodes and components, real‑time observation with second‑level latency, and quantitative numeric metrics that support statistical analysis for high‑availability and low‑latency improvements.

Data Model

Each node may host n component classes, represented by m processes. Metrics are identified by a three‑tuple (group_id, idx_key, idx_subkey) where group_id denotes the component class, idx_key the metric name, and idx_subkey (often 0) refines metrics for complex components such as order‑routing channels. This unified model enables all components to publish a consistent metric list.

Data Flow

To keep intrusion minimal, each process writes its metrics to an independent log file rather than mixing them with application logs. Log collectors incrementally scan these files and publish the records to a Kafka cluster. Within a LAN the logs are consumed directly; across WANs the processed results are synchronized to a central monitoring database, dramatically reducing bandwidth consumption.

Monitoring data flow
Monitoring data flow

Figure: Monitoring data flow diagram.

Architecture

In each LAN a log collector runs on every server, incrementally scanning metric files and publishing to Kafka. Kafka serves as the distributed message bus, and a custom C++ consumer processes the streams.

Publish/subscribe system architecture
Publish/subscribe system architecture

Figure: Publish/subscribe system architecture.

Key Technologies

librdkafka

The consumer uses the open‑source C library librdkafka (https://github.com/edenhill/librdkafka) to communicate with Kafka. It handles bootstrap broker configuration, topic subscription, metadata retrieval, and message fetching via rd_kafka_consumer_poll. The library also provides rebalance callbacks, though the current design runs a single consumer instance per node.

In‑memory Library

A proprietary shared‑memory library supplies fast in‑memory tables with hash and red‑black‑tree indexes, supporting transactional semantics and asynchronous persistence. It stores metric results, alarm rules, and other predictable data, while unpredictable data (e.g., alarm notifications) are persisted via thread‑safe queues.

Application Design

Message Decoder

Kafka payloads are JSON‑encoded. Decoders parse the JSON according to the source (e.g., log collector) and extract metric fields.

Metric Parser

The parser converts decoded JSON into (group_id, idx_key, idx_subkey) key‑value pairs. A base class CIParser defines a Next() interface; concrete parsers inherit it to handle component‑specific formats. Unit tests focus on the Next() method of each subclass.

Alarm Engine

Parsed metrics are evaluated against in‑memory alarm rules. Supported rule types include:

No rule.

Comparison rule (numeric or string).

Timeout rule (detect stale metrics).

Table‑field equality rule.

Expression‑based comparison rule.

The engine itself is stateless; alarm persistence is handled separately to allow repeated alerts.

Performance Evaluation

librdkafka Performance

Two representative components were stress‑tested without tuning librdkafka (v1.5.0). Throughput and latency were sufficient for the monitoring workload.

librdkafka performance chart
librdkafka performance chart

Business Processing Performance

The custom consumer’s processing time after receiving a Kafka message was measured on a single‑threaded in‑memory table. Most components achieved sub‑millisecond latency; variations stem from metric format complexity.

Business processing latency
Business processing latency

Optimization Directions

Replace synchronous logging with an asynchronous logger (completed, achieving 0.7 s for 2 M records across 10 threads).

Introduce multi‑threaded metric processing to exploit parallelism.

Refine the memory‑library lock from process‑level to finer‑grained locks for better concurrency.

Design a unified alarm model to reduce rule duplication and prioritize alerts.

Tune over 100 librdkafka configuration parameters for optimal throughput.

Conclusion

While Kafka‑based monitoring is not novel, this implementation demonstrates that a lightweight C++ consumer stack, combined with a well‑designed metric model and an efficient in‑memory library, can provide unified, low‑intrusion monitoring for a large‑scale distributed trading system. The same pipeline can be extended to static market data ingestion and inter‑system data exchange.

References

Neha Narkhede, Gwen Shapira, Todd Palino. Kafka: The Definitive Guide . O'Reilly, 2017.

William P. Bejeck Jr. Kafka Streams in Action . Manning, 2018.

邓俊辉. 数据结构 (C++语言版) . 清华大学出版社, 2013.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringKafkaC++librdkafka
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.