Building a Robust Monitoring System for Securities Firms with Open‑Source Tools
This article explains why securities firms must adopt comprehensive, centralized monitoring, outlines regulatory and SLA drivers, identifies common monitoring shortcomings, and provides a step‑by‑step guide using open‑source solutions like Zabbix and Grafana to design, implement, evaluate, and continuously improve monitoring management.
Background
Rapid growth of securities‑financial markets has led firms to invest heavily in IT. Operations teams must ensure safe, stable operation of information systems, detect problems early, locate faults quickly, and meet regulatory requirements such as the Security Futures Industry Information System Operations Management Specification , which mandates monitoring of data centers, networks, servers, storage, databases, and core trading applications.
Common Pain Points
Fragmented monitoring systems for different components cause alarm sprawl and high maintenance effort.
Regulatory‑required metrics are often incomplete (e.g., database tablespace, connection counts, storage battery status).
Missing business‑level indicators leads to situations where infrastructure appears healthy while the business is down.
Few visualizations follow end‑to‑end business flow, slowing root‑cause analysis.
Lack of full‑lifecycle management of monitoring items results in stale or duplicated metrics.
Inadequate quarterly evaluation of logs and alerts reduces alarm‑tuning effectiveness.
Guiding Principles
Centralized monitoring : aggregate all critical metrics across hardware, network, security devices, servers, storage, and both production and disaster‑recovery sites.
Layered object hierarchy to organize monitoring scope.
Standardized monitoring templates for each technology stack.
Closed‑loop process covering configuration, alarm handling, evaluation, and continuous improvement.
Monitoring Object Hierarchy
The hierarchy consists of four layers:
Business Data Layer : agents run SQL queries or Python scripts against application databases to collect business‑level metrics.
Application Service Layer : monitors application processes, ports, and configuration files via custom commands.
System Platform Layer : captures OS‑level metrics (CPU, memory, disk) via SNMP or agent‑based collection.
Infrastructure Communication Layer : monitors physical devices (servers, switches, UPS, HVAC) primarily via SNMP.
Standardized Monitoring Templates
Templates are defined for Windows, Linux, MySQL, SQL Server, and Oracle based on best practices. When new services are launched, bulk‑add the relevant objects and metrics to accelerate onboarding.
Monitoring Management Process
Setup & Change Management
Use a formal change‑request workflow for adding, removing, or modifying monitoring items, with documented approvals for traceability.
Alarm Handling Workflow
Four stages: detection & recording, diagnosis, remediation, and closure. Assign alarms to owners, verify false‑positives, follow emergency runbooks for high‑severity alerts, and close tickets only after verification.
Evaluation & Continuous Improvement
Quarterly reviews of capacity, availability, and threshold settings. Trend analysis of CPU, memory, network bandwidth, and business‑level metrics (e.g., transaction volume) informs expansion planning.
System Operations
Daily health checks of the monitoring platform.
Automated daily backups of configuration and data.
Periodic audit of user accounts.
Retain monitoring logs for at least one year.
Open‑Source Monitoring Stack
Architecture
Combine Zabbix (distributed data collection, processing, and alerting) with Grafana (visualization) to build a centralized, extensible monitoring solution.
Metric Categories (Google SRE)
Error metrics – e.g., failed database connections, missing middleware processes.
Latency metrics – e.g., transaction processing time > 1 s, Elasticsearch query latency > 5 s.
Traffic metrics – e.g., orders per second, network interface throughput.
Saturation metrics – e.g., CPU utilization > 80 %, disk I/O nearing capacity.
Alarm Severity
Define three levels: High (urgent), Warning (important but not urgent), Info (informational). Set thresholds based on historical averages, peak values, and business impact. Adjust check intervals to meet SLA requirements (e.g., 1‑minute checks for 99.99 % availability).
Implementation Methods
Agent monitoring : install Zabbix-agent on each host; use built‑in items or custom scripts (e.g., Python query for daily order cancellations).
Trapper monitoring : use Zabbix-sender to push JSON data without an agent (e.g., safe‑run days counter).
SNMP monitoring : poll metrics from network devices, UPS, and storage arrays.
Custom macros for DR : switch variables to enable/disable monitoring of production vs. disaster‑recovery environments.
Visualization
Use Zabbix Maps to draw topology diagrams and Grafana Status Panel plugins to present module health. Example: a financing‑margin‑trade system map where an orange panel indicates an abnormal component, enabling rapid pinpointing.
Alarm Configuration Details
Set thresholds according to historical data, peak loads, and SLA impact. For high‑availability services (e.g., 99.99 % uptime) use a 1‑minute check interval; for lower‑criticality services a longer interval may suffice. Adjust thresholds for traffic, latency, and saturation metrics based on average and maximum values, as well as business growth trends.
Example Command for EMC Storage Battery Monitoring
naviseccli -h [IP address] getcrus | grep Present | wc -ldbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
