Big Data 4 min read

Optimizing Hadoop and Hive Jobs with Filters, Projections, and Predicate Pushdown

The article explains how applying filters, projections, and predicate pushdown in Hadoop and Hive reduces data volume, speeds up MapReduce jobs, and improves performance, while also covering join limitations and providing a Java Mapper example for practical implementation.

Big Data Technology & Architecture

Apr 9, 2020

Optimizing Hadoop and Hive Jobs with Filters, Projections, and Predicate Pushdown

In traditional OLAP systems, using filter and projection during joins greatly improves performance; the same principle applies to Hadoop, where reducing the amount of data processed by applying filters and projections can significantly speed up MapReduce jobs by lowering network, disk, CPU, and memory load.

The example code shows a Mapper that filters out users younger than 30 and emits only their name and state, illustrating how to apply filters and projections close to the data source.

When joining data, if the required filter field is absent, a Bloom filter can be used as an alternative method.

Predicate pushdown is a logical optimization that pushes filter predicates to the data source, allowing systems like Parquet to skip entire row groups or files, and in relational databases to reduce data transfer from external sources.

Images illustrate how projection and predicate pushdown reduce data size and improve job performance, especially with columnar formats such as Parquet, while noting that Avro is a row‑column hybrid format.

Additional notes: Hive only supports equi‑joins for inner joins; cross joins and where‑clauses can be used for non‑equi joins, with conditions such as using the CROSS JOIN keyword, having no ON clause, or an ON clause that is always true (e.g., 1=1).

public static class JoinMap extends Mapper<LongWritable, Text, Text, Text> {
    @Override
    protected void map(LongWritable offset, Text value, Context context)
        throws IOException, InterruptedException {
        User user = User.fromText(value);
        if (user.getAge() >= 30) {
            context.write(new Text(user.getName()), new Text(user.getState()));
        }
    }
}

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive MapReduce Hadoop filter Predicate Pushdown Projection

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.