Big Data 18 min read

How eBay Scales Real‑Time Monitoring with Flink: Metadata‑Driven Streaming

This article explains how eBay’s Sherlock.IO monitoring platform processes billions of logs, events, and metrics daily using Flink Streaming jobs, detailing a metadata‑driven architecture, shared job strategies, Heartbeat‑based monitoring, job isolation, back‑pressure handling, and real‑world use cases such as Event Alerting, Eventzon, and Netmon.

dbaplus Community
dbaplus Community
dbaplus Community
How eBay Scales Real‑Time Monitoring with Flink: Metadata‑Driven Streaming

Monitoring System Flink Overview

eBay’s Sherlock.IO processes hundreds of billions of logs, events and metrics daily. The monitoring team operates eight Flink clusters; the largest contains over a thousand TaskManagers and runs hundreds of jobs, some of which have been stable for more than six months.

Metadata‑Driven Architecture

A metadata micro‑service exposes a RESTful API that describes a Flink job as a JSON DAG. The service separates three concepts:

Resource : physical assets required by a namespace (Flink cluster, Kafka brokers, Elasticsearch clusters, etc.).

Capability : a reusable DAG definition together with the Java class for each operator.

Policy : a binding of a Capability to concrete runtime parameters (Kafka topic, Elasticsearch index, operator skips, Jexl filter expressions).

The micro‑service adaptor reads this metadata and creates Flink jobs via the Flink Streaming API, shielding users from low‑level API calls. Migration to another stream engine only requires a new adaptor.

Metadata micro‑service framework
Metadata micro‑service framework

Capability

A Capability defines the DAG and the class for each operator. Example: the eventProcess Capability reads from Kafka, writes to Elasticsearch, has parallelism 5, and consists of a simple Source → Sink DAG.

eventProcess Capability
eventProcess Capability
Generated Flink job
Generated Flink job

Policy

Each namespace defines one or more Policies that bind a Capability to concrete parameters such as the Kafka topic, Elasticsearch index, and optional operator skips. Policies can embed Jexl filter expressions to drop unwanted records, improving throughput. A Zookeeper‑driven periodic update mechanism applies Policy changes without restarting the job.

Policy example
Policy example

Resource

Resource objects map a namespace to the physical clusters it should read from or write to, enabling a single job to resolve the correct Kafka and Elasticsearch endpoints at runtime.

Shared Jobs

Identical DAGs can be reused by assigning the same Capability to multiple Policies. For SQL‑based Capabilities, a single Flink job can host many Policies by allocating multiple slots (e.g., 20 slots, one per Policy). This reduces JobManager overhead and allows a single read from a Kafka topic to serve multiple namespaces.

Flink Job Optimization and Monitoring

Heartbeat

A custom Heartbeat source is injected into every job. The Heartbeat traverses the entire DAG, tagging each operator with a timestamp and finally emitting a metric to Sherlock.IO. The metric contains generation time, ingress time, and per‑operator timestamps, enabling:

Latency measurement per operator.

Data‑loss detection (missing or delayed Heartbeats).

Back‑pressure identification.

Heartbeat flow
Heartbeat flow

Availability

Three unavailability conditions are defined:

Job restart caused by OOM or runtime errors.

Job termination due to infrastructure failure or exceeding the restart limit.

Back‑pressure that halts data processing.

Heartbeat is emitted every 10 seconds (six times per minute). Availability for a pipeline is calculated as the ratio of received Heartbeats to the expected 8 640 per day.

Availability formula
Availability formula

Job Isolation

To prevent resource contention, each TaskManager is configured with a single slot and dedicated CPU/heap limits:

taskmanager.numberOfTaskSlots: 1
cpu_period: 100000
cpu_quota: 50000   # 50% of a CPU core
taskmanager.heap.mb: 4096

This guarantees that each Flink job runs in isolation, avoiding interference from other jobs on the same JVM.

Job isolation diagram
Job isolation diagram

Back‑pressure Detection

When back‑pressure occurs, the Heartbeat stalls upstream. To pinpoint the offending operator, the system periodically dumps each operator’s stack trace. The stack trace reveals the bottleneck operator.

Back‑pressure stack trace
Back‑pressure stack trace

Additional Monitoring Tools

Flink History Server – queries completed‑job metrics such as restart count and runtime.

Daemon thread – compares stored metadata with running jobs every minute; mismatches trigger alerts.

Real‑World Use Cases

Event Alerting

The EventAlertingCapability processes policies that define threshold rules. Example rule: when the application field equals "r1rover", the host name starts with "r1rover", and the metric value exceeds 90, an alert is emitted to a Kafka topic for downstream handling.

Alert rule
Alert rule
Alert output
Alert output

Eventzon

Eventzon aggregates events from many eBay services. A pipeline composed of multiple Capabilities (Filter, Deduplicate, Enrich, etc.) cleans and enriches the data before generating alerts delivered via email, Slack or PagerDuty.

Netmon

Netmon monitors network device logs. An EnrichCapability adds device metadata (IP, data‑center, rack) to each log entry. Alerts for error logs are persisted to Elasticsearch and displayed on the Netmon dashboard. Resolved alerts are automatically marked; manual resolution is also supported.

Netmon alert stack trace
Netmon alert stack trace

Conclusion

Flink Streaming satisfies eBay’s low‑latency requirements for complex real‑time monitoring. Ongoing work focuses on reducing false alerts caused by job restarts and integrating advanced AI algorithms to improve alert precision and system stability.

References

https://ci.apache.org/projects/flink/flink-docs-release-1.7/concepts/runtime.html#task-slots-and-resources

https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html

https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/historyserver.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataReal-time ProcessingFlinkStreamingmetadata service
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.