Optimizing Hadoop and Hive Jobs with Filters, Projections, and Predicate Pushdown
The article explains how applying filters, projections, and predicate pushdown in Hadoop and Hive reduces data volume, speeds up MapReduce jobs, and improves performance, while also covering join limitations and providing a Java Mapper example for practical implementation.
In traditional OLAP systems, using filter and projection during joins greatly improves performance; the same principle applies to Hadoop, where reducing the amount of data processed by applying filters and projections can significantly speed up MapReduce jobs by lowering network, disk, CPU, and memory load.
The example code shows a Mapper that filters out users younger than 30 and emits only their name and state, illustrating how to apply filters and projections close to the data source.
When joining data, if the required filter field is absent, a Bloom filter can be used as an alternative method.
Predicate pushdown is a logical optimization that pushes filter predicates to the data source, allowing systems like Parquet to skip entire row groups or files, and in relational databases to reduce data transfer from external sources.
Images illustrate how projection and predicate pushdown reduce data size and improve job performance, especially with columnar formats such as Parquet, while noting that Avro is a row‑column hybrid format.
Additional notes: Hive only supports equi‑joins for inner joins; cross joins and where‑clauses can be used for non‑equi joins, with conditions such as using the CROSS JOIN keyword, having no ON clause, or an ON clause that is always true (e.g., 1=1).
public static class JoinMap extends Mapper<LongWritable, Text, Text, Text> {
@Override
protected void map(LongWritable offset, Text value, Context context)
throws IOException, InterruptedException {
User user = User.fromText(value);
if (user.getAge() >= 30) {
context.write(new Text(user.getName()), new Text(user.getState()));
}
}
}Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
