Operations 11 min read

How to Transform Operations Monitoring with Big Data Thinking

This article explains how to apply big‑data concepts and platforms to operations monitoring, covering data sources, metric extraction from logs, architectural design with Flume, Spark Streaming and HBase, implementation steps, and the resulting benefits for scalability and rapid metric development.

dbaplus Community
dbaplus Community
dbaplus Community
How to Transform Operations Monitoring with Big Data Thinking

Current Operations Monitoring Landscape

Many organizations only monitor basic server health (CPU, memory, etc.) while business‑level monitoring is fragmented, often built with ad‑hoc scripts or disparate third‑party tools.

Key Data Sources and Metric Categories

All relevant data ultimately originate from logs, whether text or binary. From logs you can derive four groups of metrics:

Business metrics – e.g., transactions per second, orders created per minute.

Application metrics – error counts, latency percentiles (95th, max), request traces.

System‑resource metrics – CPU, memory, swap, disk usage, load average.

Network metrics – packet loss, ping latency, traffic volume, TCP connection counts.

Big‑Data‑Driven Monitoring Architecture

A platform can be built from existing big‑data components:

Log collection agents (e.g., Flume or custom agents).

Real‑time processing with Spark Streaming (or Storm).

Aggregated metric storage in a scalable NoSQL store such as HBase (or Elasticsearch for ELK‑style dashboards).

Visualization layer (custom dashboards or Kibana/ELK).

The architecture is illustrated below:

Architecture diagram
Architecture diagram

Implementation Example

The author applied the stack to monitor three services: recommendation, search, and a unified query engine. The concrete monitoring outputs include:

Status‑code dashboard – URL‑level ranking of HTTP 5xx responses (top 100 URLs) with associated service.

Response‑time dashboard – URL‑level ranking of average latency over a recent window (e.g., 5 minutes).

Trace system – a unique request UUID enables end‑to‑end request‑chain visibility similar to Google Dapper or Taobao EagleEye. Requires instrumentation of RPC/HTTP calls.

When a service stops emitting logs, the platform can infer that the service is down without explicit health‑check scripts.

Development Tasks

One‑time dashboard – build a visualization that reads aggregated metrics from HBase (or Elasticsearch) and presents them.

Long‑term streaming job – implement a Spark Streaming (or Storm) application that parses incoming logs, computes the defined metrics, and writes results to HBase.

Practical Adoption Steps

Enumerate all log‑producing sources across engineering and business domains.

Define the concrete metrics to be extracted from each log source.

Select and provision the big‑data components (Flume → Spark Streaming → HBase/Elasticsearch) and wire them together as modular building blocks.

Cautions and Extensions

Enforce a unified log format across services to avoid costly compatibility layers.

Persist raw or intermediate processed data to enable ad‑hoc analysis with SparkSQL; new metrics can be added by deploying an additional Spark Streaming job within hours.

For complex product lines, allocate effort to standardize log schemas while gradually adapting the processing pipeline.

Implementation diagram
Implementation diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

log analysisSpark Streaming
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.