Big Data 25 min read

Learning Spark Operations with Java Stream Concepts: Map, FlatMap, GroupBy, Reduce Examples

This article demonstrates how Java Stream operations such as map, flatMap, groupBy, and reduce can be directly applied to Spark, providing step‑by‑step code examples, explanations of transformation versus action operators, and practical tips for handling exceptions in distributed data processing.

Code Ape Tech Column

Jul 6, 2024

Learning Spark Operations with Java Stream Concepts: Map, FlatMap, GroupBy, Reduce Examples

Preparation

Test data representing name, age, department, and position is prepared in a simple CSV format.

张三,20,研发部,普通员工<br/>李四,31,研发部,普通员工<br/>李丽,36,财务部,普通员工<br/>张伟,38,研发部,经理<br/>杜航,25,人事部,普通员工<br/>周歌,28,研发部,普通员工

An Employee class with Lombok annotations is defined.

@Getter<br/>@Setter<br/>@AllArgsConstructor<br/>@NoArgsConstructor<br/>@ToString<br/>static class Employee implements Serializable {<br/>    private String name;<br/>    private Integer age;<br/>    private String department;<br/>    private String level;<br/>}

Map Operations

Java Stream map

Read the file and map each line to an Employee object.

List<String> list = FileUtils.readLines(new File("f:/test.txt"), "utf-8");
List<Employee> employeeList = list.stream().map(word -> {
    List<String> words = Arrays.stream(word.split(",")).collect(Collectors.toList());
    Employee employee = new Employee(words.get(0), Integer.parseInt(words.get(1)), words.get(2), words.get(3));
    return employee;
}).collect(Collectors.toList());
employeeList.forEach(System.out::println);

Spark map

Create a SparkSession, read the text file, and map each row to an Employee using a MapFunction.

SparkSession session = SparkSession.builder().master("local[*]").getOrCreate();
Dataset<Row> reader = session.read().text("F:/test.txt");
Dataset<Employee> employeeDataset = reader.map(new MapFunction<Row, Employee>() {
    @Override
    public Employee call(Row row) throws Exception {
        List<String> list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
        return new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3));
    }
}, Encoders.bean(Employee.class));
employeeDataset.show();

MapPartitions Function

Demonstrates processing a partition (a batch of rows) at once.

Dataset<Employee> employeeDataset2 = reader.mapPartitions(new MapPartitionsFunction<Row, Employee>() {
    @Override
    public Iterator<Employee> call(Iterator<Row> iterator) throws Exception {
        List<Employee> employeeList = new ArrayList<>();
        while (iterator.hasNext()) {
            Row row = iterator.next();
            List<String> list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
            employeeList.add(new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3)));
        }
        return employeeList.iterator();
    }
}, Encoders.bean(Employee.class));
employeeDataset2.show();

FlatMap Operations

Java Stream flatMap

Maps one input line to two Employee objects.

List<Employee> employeeList2 = list.stream().flatMap(word -> {
    List<String> words = Arrays.stream(word.split(",")).collect(Collectors.toList());
    List<Employee> lists = new ArrayList<>();
    Employee e1 = new Employee(words.get(0), Integer.parseInt(words.get(1)), words.get(2), words.get(3));
    Employee e2 = new Employee(words.get(0) + "_2", Integer.parseInt(words.get(1)), words.get(2), words.get(3));
    lists.add(e1);
    lists.add(e2);
    return lists.stream();
}).collect(Collectors.toList());
employeeList2.forEach(System.out::println);

Spark flatMap

Implements FlatMapFunction to emit multiple Employee objects per row.

Dataset<Employee> employeeDatasetFlatmap = reader.flatMap(new FlatMapFunction<Row, Employee>() {
    @Override
    public Iterator<Employee> call(Row row) throws Exception {
        List<String> list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
        List<Employee> employeeList = new ArrayList<>();
        employeeList.add(new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3)));
        employeeList.add(new Employee(list.get(0) + "_2", Integer.parseInt(list.get(1)), list.get(2), list.get(3)));
        return employeeList.iterator();
    }
}, Encoders.bean(Employee.class));
employeeDatasetFlatmap.show();

GroupBy Operations

Java Stream groupBy

Counts employees per department.

Map<String, Long> map = employeeList.stream()
    .collect(Collectors.groupingBy(Employee::getDepartment, Collectors.counting()));
System.out.println(map);

Spark groupBy

Groups by department and computes count and average age.

RelationalGroupedDataset ds = employeeDataset.groupBy("department");
ds.count().show();
ds.avg("age").withColumnRenamed("avg(age)", "avgAge").show();

Reduce Operations

Shows how to aggregate ages using reduce in both Java Stream and Spark.

// Java Stream sum of ages
int totalAge = employeeList.stream().mapToInt(e -> e.getAge()).sum();
// Spark reduce to accumulate ages into a single Employee
Employee datasetReduce = employeeDataset.reduce(new ReduceFunction<Employee>() {
    @Override
    public Employee call(Employee t1, Employee t2) throws Exception {
        t2.setAge(t1.getAge() + t2.getAge());
        return t2;
    }
});
System.out.println(datasetReduce);

Other Common Operations

Filtering, limiting, sorting, and using temporary tables with SQL.

Employee e = employeeDataset.filter("age > 30").limit(3).sort("age").first();
employeeDataset.registerTempTable("table");
session.sql("select * from table where age > 30 order by age desc limit 3").show();

Conclusion

The article provides a simple introduction to Spark operators by leveraging the similarity with Java Stream APIs, encouraging backend developers to try big‑data development locally with only Maven dependencies.

Promotion

The author asks readers to like, follow, share, and collect the article, and promotes a paid knowledge community offering Spring, MyBatis, DDD, RocketMQ and other advanced topics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

flatMap reduce Spark groupby Java Stream map function

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.