Operations 33 min read

Building a Robust Monitoring System for Securities Firms with Open‑Source Tools

This article explains why securities firms must adopt comprehensive, centralized monitoring, outlines regulatory and SLA drivers, identifies common monitoring shortcomings, and provides a step‑by‑step guide using open‑source solutions like Zabbix and Grafana to design, implement, evaluate, and continuously improve monitoring management.

dbaplus Community
dbaplus Community
dbaplus Community
Building a Robust Monitoring System for Securities Firms with Open‑Source Tools

Background

Rapid growth of securities‑financial markets has led firms to invest heavily in IT. Operations teams must ensure safe, stable operation of information systems, detect problems early, locate faults quickly, and meet regulatory requirements such as the Security Futures Industry Information System Operations Management Specification , which mandates monitoring of data centers, networks, servers, storage, databases, and core trading applications.

Common Pain Points

Fragmented monitoring systems for different components cause alarm sprawl and high maintenance effort.

Regulatory‑required metrics are often incomplete (e.g., database tablespace, connection counts, storage battery status).

Missing business‑level indicators leads to situations where infrastructure appears healthy while the business is down.

Few visualizations follow end‑to‑end business flow, slowing root‑cause analysis.

Lack of full‑lifecycle management of monitoring items results in stale or duplicated metrics.

Inadequate quarterly evaluation of logs and alerts reduces alarm‑tuning effectiveness.

Guiding Principles

Centralized monitoring : aggregate all critical metrics across hardware, network, security devices, servers, storage, and both production and disaster‑recovery sites.

Layered object hierarchy to organize monitoring scope.

Standardized monitoring templates for each technology stack.

Closed‑loop process covering configuration, alarm handling, evaluation, and continuous improvement.

Monitoring Object Hierarchy

The hierarchy consists of four layers:

Business Data Layer : agents run SQL queries or Python scripts against application databases to collect business‑level metrics.

Application Service Layer : monitors application processes, ports, and configuration files via custom commands.

System Platform Layer : captures OS‑level metrics (CPU, memory, disk) via SNMP or agent‑based collection.

Infrastructure Communication Layer : monitors physical devices (servers, switches, UPS, HVAC) primarily via SNMP.

Monitoring object hierarchy diagram
Monitoring object hierarchy diagram

Standardized Monitoring Templates

Templates are defined for Windows, Linux, MySQL, SQL Server, and Oracle based on best practices. When new services are launched, bulk‑add the relevant objects and metrics to accelerate onboarding.

Monitoring Management Process

Setup & Change Management

Use a formal change‑request workflow for adding, removing, or modifying monitoring items, with documented approvals for traceability.

Alarm Handling Workflow

Four stages: detection & recording, diagnosis, remediation, and closure. Assign alarms to owners, verify false‑positives, follow emergency runbooks for high‑severity alerts, and close tickets only after verification.

Evaluation & Continuous Improvement

Quarterly reviews of capacity, availability, and threshold settings. Trend analysis of CPU, memory, network bandwidth, and business‑level metrics (e.g., transaction volume) informs expansion planning.

System Operations

Daily health checks of the monitoring platform.

Automated daily backups of configuration and data.

Periodic audit of user accounts.

Retain monitoring logs for at least one year.

Open‑Source Monitoring Stack

Architecture

Combine Zabbix (distributed data collection, processing, and alerting) with Grafana (visualization) to build a centralized, extensible monitoring solution.

Metric Categories (Google SRE)

Error metrics – e.g., failed database connections, missing middleware processes.

Latency metrics – e.g., transaction processing time > 1 s, Elasticsearch query latency > 5 s.

Traffic metrics – e.g., orders per second, network interface throughput.

Saturation metrics – e.g., CPU utilization > 80 %, disk I/O nearing capacity.

Alarm Severity

Define three levels: High (urgent), Warning (important but not urgent), Info (informational). Set thresholds based on historical averages, peak values, and business impact. Adjust check intervals to meet SLA requirements (e.g., 1‑minute checks for 99.99 % availability).

Implementation Methods

Agent monitoring : install Zabbix-agent on each host; use built‑in items or custom scripts (e.g., Python query for daily order cancellations).

Trapper monitoring : use Zabbix-sender to push JSON data without an agent (e.g., safe‑run days counter).

SNMP monitoring : poll metrics from network devices, UPS, and storage arrays.

Custom macros for DR : switch variables to enable/disable monitoring of production vs. disaster‑recovery environments.

Visualization

Use Zabbix Maps to draw topology diagrams and Grafana Status Panel plugins to present module health. Example: a financing‑margin‑trade system map where an orange panel indicates an abnormal component, enabling rapid pinpointing.

Monitoring topology example
Monitoring topology example

Alarm Configuration Details

Set thresholds according to historical data, peak loads, and SLA impact. For high‑availability services (e.g., 99.99 % uptime) use a 1‑minute check interval; for lower‑criticality services a longer interval may suffice. Adjust thresholds for traffic, latency, and saturation metrics based on average and maximum values, as well as business growth trends.

Example Command for EMC Storage Battery Monitoring

naviseccli -h [IP address] getcrus | grep Present | wc -l
operationsopen-sourceGrafanaZabbixIT infrastructuresecurities
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.