Big Data 12 min read

How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization

This article explains how eBay's CAL team reduced Hadoop MapReduce job execution time and resource consumption by over 60% through targeted GC tuning, data‑skew mitigation, and algorithmic improvements, boosting job success rates to nearly 100% while handling petabyte‑scale log data.

dbaplus Community
dbaplus Community
dbaplus Community
How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization

Background

eBay's Central Application Logging (CAL) system collects petabyte‑scale logs from many applications and generates reports via Hadoop MapReduce jobs. The reports provide API latency percentiles, service call graphs, and database operation metrics. Growing log volume and shared cluster usage caused long runtimes, high resource consumption, and a 92.5% job success rate.

Why Optimize

Before optimization, CAL jobs consumed about 50% of the Hadoop cluster, with nine hours each day limited to only 19% of resources, leading to long queues. Success rates were low, and execution times varied widely.

Optimization Goals

The team focused on two dimensions: reducing execution time and lowering resource usage.

1. Execution Time

Job duration is dominated by the slowest Mapper and Reducer tasks. The team modeled execution time (T) as a function of Mapper/Reducer task counts and durations, and identified three levers:

Garbage‑collection (GC) overhead

Avoiding data skew in Mapper and Reducer

Algorithmic improvements

2. Resource Usage

Memory usage (R) correlates with Mapper/Reducer container sizes and task counts. Reducing the number of tasks, shrinking container memory, and shortening job duration were targeted.

Solution Details

GC Tuning

Excessive GC caused "GC overhead" and Out‑of‑Memory failures. In the Mapper, the entire CAL record tree was previously kept in memory; the team changed this to retain only required metrics. In the Reducer, they switched the key from timestamp+metric to metric+timestamp, reducing in‑memory data from 3N to 3 entries per metric, dramatically cutting GC time.

Data‑Skew Mitigation

Mapper skew was addressed by using CombineFileInputFormat to merge small files into 256 MB splits, halving the number of Mapper tasks. Reducer skew was reduced by changing the partitioning strategy from hashing only the report name to hashing both report name and metric name.

Algorithmic Improvements

The team observed that job time is proportional to input record count. By caching parsed SQL logs and avoiding repeated parsing of referenced logs, they reduced the runtime of a large job (dataset B) from 8 minutes to 4 minutes. They also adjusted job parameters based on log type (SQL vs. event logs).

Results

After the three‑pronged optimization, job execution time dropped from 90 minutes to about 40 minutes (95th percentile), resource usage fell from 50% to 19% of the cluster, and success rates rose from 92.5% to roughly 99.9%—equivalent to freeing over 200 Hadoop nodes.

Conclusion

Effective Hadoop optimization requires careful GC management, data‑skew handling, and algorithmic tuning. Monitoring key KPIs such as success rate, resource usage, and job duration is essential for continuous improvement in large‑scale log processing pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationBig DataResource ManagementData SkewMapReduceHadoopGC tuning
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.