How eBay Scales Real‑Time Monitoring with Flink: Metadata‑Driven Streaming
This article explains how eBay’s Sherlock.IO monitoring platform processes billions of logs, events, and metrics daily using Flink Streaming jobs, detailing a metadata‑driven architecture, shared job strategies, Heartbeat‑based monitoring, job isolation, back‑pressure handling, and real‑world use cases such as Event Alerting, Eventzon, and Netmon.
Monitoring System Flink Overview
eBay’s Sherlock.IO processes hundreds of billions of logs, events and metrics daily. The monitoring team operates eight Flink clusters; the largest contains over a thousand TaskManagers and runs hundreds of jobs, some of which have been stable for more than six months.
Metadata‑Driven Architecture
A metadata micro‑service exposes a RESTful API that describes a Flink job as a JSON DAG. The service separates three concepts:
Resource : physical assets required by a namespace (Flink cluster, Kafka brokers, Elasticsearch clusters, etc.).
Capability : a reusable DAG definition together with the Java class for each operator.
Policy : a binding of a Capability to concrete runtime parameters (Kafka topic, Elasticsearch index, operator skips, Jexl filter expressions).
The micro‑service adaptor reads this metadata and creates Flink jobs via the Flink Streaming API, shielding users from low‑level API calls. Migration to another stream engine only requires a new adaptor.
Capability
A Capability defines the DAG and the class for each operator. Example: the eventProcess Capability reads from Kafka, writes to Elasticsearch, has parallelism 5, and consists of a simple Source → Sink DAG.
Policy
Each namespace defines one or more Policies that bind a Capability to concrete parameters such as the Kafka topic, Elasticsearch index, and optional operator skips. Policies can embed Jexl filter expressions to drop unwanted records, improving throughput. A Zookeeper‑driven periodic update mechanism applies Policy changes without restarting the job.
Resource
Resource objects map a namespace to the physical clusters it should read from or write to, enabling a single job to resolve the correct Kafka and Elasticsearch endpoints at runtime.
Shared Jobs
Identical DAGs can be reused by assigning the same Capability to multiple Policies. For SQL‑based Capabilities, a single Flink job can host many Policies by allocating multiple slots (e.g., 20 slots, one per Policy). This reduces JobManager overhead and allows a single read from a Kafka topic to serve multiple namespaces.
Flink Job Optimization and Monitoring
Heartbeat
A custom Heartbeat source is injected into every job. The Heartbeat traverses the entire DAG, tagging each operator with a timestamp and finally emitting a metric to Sherlock.IO. The metric contains generation time, ingress time, and per‑operator timestamps, enabling:
Latency measurement per operator.
Data‑loss detection (missing or delayed Heartbeats).
Back‑pressure identification.
Availability
Three unavailability conditions are defined:
Job restart caused by OOM or runtime errors.
Job termination due to infrastructure failure or exceeding the restart limit.
Back‑pressure that halts data processing.
Heartbeat is emitted every 10 seconds (six times per minute). Availability for a pipeline is calculated as the ratio of received Heartbeats to the expected 8 640 per day.
Job Isolation
To prevent resource contention, each TaskManager is configured with a single slot and dedicated CPU/heap limits:
taskmanager.numberOfTaskSlots: 1
cpu_period: 100000
cpu_quota: 50000 # 50% of a CPU core
taskmanager.heap.mb: 4096This guarantees that each Flink job runs in isolation, avoiding interference from other jobs on the same JVM.
Back‑pressure Detection
When back‑pressure occurs, the Heartbeat stalls upstream. To pinpoint the offending operator, the system periodically dumps each operator’s stack trace. The stack trace reveals the bottleneck operator.
Additional Monitoring Tools
Flink History Server – queries completed‑job metrics such as restart count and runtime.
Daemon thread – compares stored metadata with running jobs every minute; mismatches trigger alerts.
Real‑World Use Cases
Event Alerting
The EventAlertingCapability processes policies that define threshold rules. Example rule: when the application field equals "r1rover", the host name starts with "r1rover", and the metric value exceeds 90, an alert is emitted to a Kafka topic for downstream handling.
Eventzon
Eventzon aggregates events from many eBay services. A pipeline composed of multiple Capabilities (Filter, Deduplicate, Enrich, etc.) cleans and enriches the data before generating alerts delivered via email, Slack or PagerDuty.
Netmon
Netmon monitors network device logs. An EnrichCapability adds device metadata (IP, data‑center, rack) to each log entry. Alerts for error logs are persisted to Elasticsearch and displayed on the Netmon dashboard. Resolved alerts are automatically marked; manual resolution is also supported.
Conclusion
Flink Streaming satisfies eBay’s low‑latency requirements for complex real‑time monitoring. Ongoing work focuses on reducing false alerts caused by job restarts and integrating advanced AI algorithms to improve alert precision and system stability.
References
https://ci.apache.org/projects/flink/flink-docs-release-1.7/concepts/runtime.html#task-slots-and-resources
https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/metrics.html
https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/historyserver.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
