How Apache Eagle Secures Hadoop: Real‑Time Big Data Threat Detection
Apache Eagle is an open‑source, distributed, real‑time security monitoring platform for Hadoop that combines stream‑processing, scalable policy enforcement, and machine‑learning user profiling to protect massive data assets across eBay’s production clusters.
Apache Eagle is an open‑source distributed real‑time security monitoring solution for Hadoop ecosystems, offering a highly scalable stream‑processing policy engine, machine‑learning based user profiling, and flexible alerting.
Announcement
eBay announced the open‑source release of Apache Eagle, which became an Apache Incubator project on 26 October 2015.
<code>http://goeagle.io</code>Eagle at eBay
The Eagle data‑behavior monitoring system is deployed on a Hadoop cluster with over 2,500 nodes, protecting hundreds of petabytes of data, and plans to expand to more than ten clusters covering over 10,000 nodes.
In production, basic security policies have been configured for HDFS, Hive, and other services, with additional policies to be added throughout the year.
Policies cover access patterns, frequent data sets, predefined query types, Hive tables and columns, HBase tables, and user‑profile‑based rules, as well as preventing data loss, unauthorized copying, and sensitive data exposure.
Project Background
With the growth of big data, eBay processes petabytes of data across more than 10,000 Hadoop nodes, supporting billions of users. Managing and monitoring such scale introduces severe security challenges.
eBay’s security measures include access control, isolation, data classification, encryption, and real‑time behavior monitoring.
Existing products could not meet the need for massive real‑time data‑flow monitoring, prompting eBay to build Eagle from scratch and open it to the community.
Eagle Features
High real‑time: alerts generated within sub‑second latency.
Scalable: handles billions of data accesses daily across multi‑petabyte clusters.
Easy to use: sandbox environment enables setup in minutes with ready‑made examples.
User profiling: machine‑learning algorithms build behavior models for anomaly detection.
Open source: released under the Apache License and built on many open‑source big‑data projects.
Eagle Overview
1. Architecture – Data Collection and Storage
Eagle provides extensible APIs to ingest data from sources such as Kafka (for HDFS audit logs) and YARN (for Hive query logs), ensuring scalability and fault tolerance.
2. Data Processing
The Stream Processing API abstracts the underlying engine; by default it supports Apache Storm but can be extended to Flink, Samza, etc. Developers define DAGs for transformation, filtering, and joining without binding to a specific platform.
StormExecutionEnvironment env = ExecutionEnvironmentFactory.getStorm(config); StreamProducer producer = env.newSource(new KafkaSourcedSpoutProvider().getSpout(config)).renameOutputFields(1) .flatMap(new AuditLogTransformer()) .groupBy(Arrays.asList(0)) .flatMap(new UserProfileAggregatorExecutor()) .alertWithConsumer("userActivity","userProfileExecutor"); env.execute();
The alerting framework consists of stream metadata API, policy engine service API, and partitioner API, enabling distributed policy execution.
Stream Metadata API: defines event schemas and runtime parsing.
Policy Engine Service API: plug‑in support for engines such as WSO2 Siddhi CEP and machine‑learning evaluators.
Policy Partitioner API: distributes policies across nodes for parallel execution.
public interface PolicyEvaluatorServiceProvider { String getPolicyType(); Class getPolicyEvaluator(); List getBindingModules();} public interface PolicyEvaluator { void evaluate(ValuesArray input) throws Exception; void onPolicyUpdate(AlertDefinitionAPIEntity newAlertDef); void onPolicyDelete();}
public interface PolicyPartitioner extends Serializable { int partition(int numTotalPartitions, String policyType, String policyId);}
3. Machine‑Learning Module
Eagle builds user profiles from historical Hadoop usage using algorithms such as Eigen‑Value Decomposition and Density Estimation, enabling real‑time anomaly detection without predefined thresholds.
Two algorithms are provided:
Density Estimation: models probability density functions for each user’s behavior and flags low‑probability events as anomalies.
Eigen‑Value Decomposition: reduces dimensionality to identify normal sub‑spaces; deviations are detected via Euclidean distance.
Training jobs run on Spark and are scheduled automatically, updating models monthly with minute‑level granularity.
Service Components
Policy Manager: offers a user‑friendly UI and REST API for creating and managing policies, browsing HDFS/Hive resources, and viewing alerts.
Example policies:
Single‑event policy detecting access to sensitive Hive columns.
from hiveAccessLogStream[sensitivityType==’PHONE_NUMBER’] select * insert into outputStream;
Windowed policy detecting more than five accesses to /tmp/private within ten minutes.
hdfsAuditLogEventStream[(src==‘/tmp/private’)]#window.externalTime(timestamp,10 min) select user, count(timestamp) as aggValue group by user having aggValue >= 5 insert into outputStream;
Query Service
Eagle provides a SQL‑like REST API for large‑scale data queries, supporting filtering, aggregation, sorting, and pagination. HBase is the default storage, but JDBC‑compatible databases are also supported.
query=AlertDefinitionService[@dataSource=”hiveQueryLog”]{@policyDef}&pageSize=100000
Future Plans
Eagle is being extended to monitor node health, Hadoop application performance, and core services, with automation for node repair and resource utilization optimization.
Upcoming features include expanded machine‑learning support for Hive and HBase, APIs for integration with external monitoring tools (e.g., Ganglia, Nagios), and additional modules for HBase monitoring, Hadoop job performance, and node health.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.