Unified Metrics, Tracing, and Logging: A Financial Firm’s Path to Microservice Observability
Facing the challenges of distributed microservice architectures, a financial services company implemented a unified observability platform that combines metrics, tracing, and logging via OpenTelemetry and custom agents, enabling real‑time visualization, anomaly detection, and performance analysis across seven core business middle‑platforms.
Introduction
Microservice architectures are increasingly adopted across industries for their lightweight, agile, and maintainable characteristics. However, the distributed nature of microservices creates observability challenges for developers, testers, operators, and business analysts. Traditional monitoring methods no longer suffice.
Background
In June 2019, Oriental Securities released the gRPC‑Nebula service‑governance framework and announced a "big middle‑platform" strategy. To support rapid business innovation, the company reorganized its wealth‑management domain into seven core middle‑platforms (account, product, sales, asset, transaction summary, market data, and information), all built on gRPC‑Nebula and accessed via a service‑governance platform.
Observability Challenges
Developers must trace end‑to‑end request topologies across a web‑like service call graph.
Testers need to reconstruct request flows from logs across multiple nodes.
Operators must pinpoint faulty nodes and measure latency per service and interface.
Business analysts require consolidated data from multiple platforms for accurate reporting.
Key Concepts
Observability
Originating from control theory, observability measures how well internal system states can be inferred from external outputs. In distributed systems, both component‑level outputs (logging, metrics) and inter‑component flows (tracing) are required.
Three Pillars of Observability
Metrics Data : Counters, gauges, histograms, summaries.
Logging Data : Fine‑grained events, variables, request/response records.
Tracing Data : Distributed request lifecycles represented by trace IDs and spans.
These pillars complement each other: metrics trigger alerts, tracing locates the problematic module, and logs reveal the root cause.
OpenTelemetry
OpenTelemetry, launched in 2019 by merging OpenTracing and OpenCensus, provides a standardized data model, SDKs, and exporters for traces, metrics, and logs. It recommends Prometheus for metrics storage and Jaeger for tracing; logging standardization is still evolving.
Proposed Observability Solution
The solution, named the Oriental Securities Observability Platform, integrates logging, metrics, and tracing into a single pipeline.
Technical Architecture
The platform consists of three layers (see Image 3):
Data‑collection Agent : Captures logs and trace data in real time, assigns a common traceID, and publishes to Kafka topics Logging and Tracing.
Data‑Processing Module : Consumes Kafka streams, stores raw logs and traces in Elasticsearch, aggregates statistics, and writes results to MySQL.
Data‑Visualization Module : Presents correlated logs, traces, and metrics through dashboards.
Key Techniques
TraceID Generation & MDC : UUID‑based traceIDs ensure uniqueness. The traceID is stored in the Mapped Diagnostic Context (MDC) so that all logs and spans generated in the same thread hierarchy share the same identifier.
Log Format & Collection : Logs follow a unified pattern timestamp [LEVEL]: message, with timestamps in yyyy‑MM‑dd HH:mm:ss SSS format and JSON‑encoded message bodies. Custom LogbackAppender and Log4j2Appender with filters and converters forward logs to Kafka.
Span Model & Propagation : Each request creates a traceID; spans carry spanID and parentSpanID (pSpanID). Trace context is propagated via gRPC HTTP headers, enabling end‑to‑end span reconstruction across services.
Metrics Model
Metrics are divided into system and business categories, collected daily and historically. Business‑specific metrics for wealth‑sales (see Image 4) are stored in MySQL for historical analysis and visualized via Grafana.
Implementation Effects
Distributed Call‑Chain Visualization
The platform renders request‑level call‑chain trees with service name, method, latency, status code, and request/response payloads. Visualization reduced test‑execution tracing time by 90%.
Anomaly Detection & Diagnosis
When an error event triggers an alarm, the platform locates the offending log via traceID, displays the problematic span, and shows its input/output parameters, cutting diagnosis time by roughly 90%.
Metric‑Trace Correlation
Daily service‑level call‑volume metrics link to the list of requests for a specific interface, which in turn opens the corresponding call‑chain tree. This enables pinpointing high‑latency spans for performance tuning and correlating error logs with metric spikes.
Real‑Time Dashboard
Grafana dashboards display both system and business metrics for the wealth‑sales domain, providing an at‑a‑glance view of service health and business performance.
Conclusion
The presented solution addresses the observability gaps inherent in distributed microservice architectures by fusing metrics, tracing, and logging with low intrusion through SDK integration in the gRPC‑Nebula framework. It provides developers with topology insights, testers with faster execution tracing, operators with precise fault isolation, and business users with customizable reports. The approach is broadly applicable to other enterprises facing similar observability challenges.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
