Big Data 25 min read

Learning Spark Operations with Java Stream Concepts: Map, FlatMap, GroupBy, Reduce Examples

This article demonstrates how Java Stream operations such as map, flatMap, groupBy, and reduce can be directly applied to Spark, providing step‑by‑step code examples, explanations of transformation versus action operators, and practical tips for handling exceptions in distributed data processing.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
Learning Spark Operations with Java Stream Concepts: Map, FlatMap, GroupBy, Reduce Examples

Preparation

Test data representing name, age, department, and position is prepared in a simple CSV format.

张三,20,研发部,普通员工
李四,31,研发部,普通员工
李丽,36,财务部,普通员工
张伟,38,研发部,经理
杜航,25,人事部,普通员工
周歌,28,研发部,普通员工

An Employee class with Lombok annotations is defined.

@Getter
@Setter
@AllArgsConstructor
@NoArgsConstructor
@ToString
static class Employee implements Serializable {
private String name;
private Integer age;
private String department;
private String level;
}

Map Operations

Java Stream map

Read the file and map each line to an Employee object.

List
list = FileUtils.readLines(new File("f:/test.txt"), "utf-8");
List
employeeList = list.stream().map(word -> {
    List
words = Arrays.stream(word.split(",")).collect(Collectors.toList());
    Employee employee = new Employee(words.get(0), Integer.parseInt(words.get(1)), words.get(2), words.get(3));
    return employee;
}).collect(Collectors.toList());
employeeList.forEach(System.out::println);

Spark map

Create a SparkSession , read the text file, and map each row to an Employee using a MapFunction .

SparkSession session = SparkSession.builder().master("local[*]").getOrCreate();
Dataset
reader = session.read().text("F:/test.txt");
Dataset
employeeDataset = reader.map(new MapFunction
() {
    @Override
    public Employee call(Row row) throws Exception {
        List
list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
        return new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3));
    }
}, Encoders.bean(Employee.class));
employeeDataset.show();

MapPartitions Function

Demonstrates processing a partition (a batch of rows) at once.

Dataset
employeeDataset2 = reader.mapPartitions(new MapPartitionsFunction
() {
    @Override
    public Iterator
call(Iterator
iterator) throws Exception {
        List
employeeList = new ArrayList<>();
        while (iterator.hasNext()) {
            Row row = iterator.next();
            List
list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
            employeeList.add(new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3)));
        }
        return employeeList.iterator();
    }
}, Encoders.bean(Employee.class));
employeeDataset2.show();

FlatMap Operations

Java Stream flatMap

Maps one input line to two Employee objects.

List
employeeList2 = list.stream().flatMap(word -> {
    List
words = Arrays.stream(word.split(",")).collect(Collectors.toList());
    List
lists = new ArrayList<>();
    Employee e1 = new Employee(words.get(0), Integer.parseInt(words.get(1)), words.get(2), words.get(3));
    Employee e2 = new Employee(words.get(0) + "_2", Integer.parseInt(words.get(1)), words.get(2), words.get(3));
    lists.add(e1);
    lists.add(e2);
    return lists.stream();
}).collect(Collectors.toList());
employeeList2.forEach(System.out::println);

Spark flatMap

Implements FlatMapFunction to emit multiple Employee objects per row.

Dataset
employeeDatasetFlatmap = reader.flatMap(new FlatMapFunction
() {
    @Override
    public Iterator
call(Row row) throws Exception {
        List
list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
        List
employeeList = new ArrayList<>();
        employeeList.add(new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3)));
        employeeList.add(new Employee(list.get(0) + "_2", Integer.parseInt(list.get(1)), list.get(2), list.get(3)));
        return employeeList.iterator();
    }
}, Encoders.bean(Employee.class));
employeeDatasetFlatmap.show();

GroupBy Operations

Java Stream groupBy

Counts employees per department.

Map
map = employeeList.stream()
    .collect(Collectors.groupingBy(Employee::getDepartment, Collectors.counting()));
System.out.println(map);

Spark groupBy

Groups by department and computes count and average age.

RelationalGroupedDataset ds = employeeDataset.groupBy("department");
ds.count().show();
ds.avg("age").withColumnRenamed("avg(age)", "avgAge").show();

Reduce Operations

Shows how to aggregate ages using reduce in both Java Stream and Spark.

// Java Stream sum of ages
int totalAge = employeeList.stream().mapToInt(e -> e.getAge()).sum();
// Spark reduce to accumulate ages into a single Employee
Employee datasetReduce = employeeDataset.reduce(new ReduceFunction
() {
    @Override
    public Employee call(Employee t1, Employee t2) throws Exception {
        t2.setAge(t1.getAge() + t2.getAge());
        return t2;
    }
});
System.out.println(datasetReduce);

Other Common Operations

Filtering, limiting, sorting, and using temporary tables with SQL.

Employee e = employeeDataset.filter("age > 30").limit(3).sort("age").first();
employeeDataset.registerTempTable("table");
session.sql("select * from table where age > 30 order by age desc limit 3").show();

Conclusion

The article provides a simple introduction to Spark operators by leveraging the similarity with Java Stream APIs, encouraging backend developers to try big‑data development locally with only Maven dependencies.

Promotion

The author asks readers to like, follow, share, and collect the article, and promotes a paid knowledge community offering Spring, MyBatis, DDD, RocketMQ and other advanced topics.

Big DataFlatMapreduceSparkgroupbyJava StreamMap Function
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.