Learning Spark Operations with Java Stream Concepts: Map, FlatMap, GroupBy, Reduce Examples
This article demonstrates how Java Stream operations such as map, flatMap, groupBy, and reduce can be directly applied to Spark, providing step‑by‑step code examples, explanations of transformation versus action operators, and practical tips for handling exceptions in distributed data processing.
Preparation
Test data representing name, age, department, and position is prepared in a simple CSV format.
张三,20,研发部,普通员工
李四,31,研发部,普通员工
李丽,36,财务部,普通员工
张伟,38,研发部,经理
杜航,25,人事部,普通员工
周歌,28,研发部,普通员工An Employee class with Lombok annotations is defined.
@Getter
@Setter
@AllArgsConstructor
@NoArgsConstructor
@ToString
static class Employee implements Serializable {
private String name;
private Integer age;
private String department;
private String level;
}Map Operations
Java Stream map
Read the file and map each line to an Employee object.
List
list = FileUtils.readLines(new File("f:/test.txt"), "utf-8");
List
employeeList = list.stream().map(word -> {
List
words = Arrays.stream(word.split(",")).collect(Collectors.toList());
Employee employee = new Employee(words.get(0), Integer.parseInt(words.get(1)), words.get(2), words.get(3));
return employee;
}).collect(Collectors.toList());
employeeList.forEach(System.out::println);Spark map
Create a SparkSession , read the text file, and map each row to an Employee using a MapFunction .
SparkSession session = SparkSession.builder().master("local[*]").getOrCreate();
Dataset
reader = session.read().text("F:/test.txt");
Dataset
employeeDataset = reader.map(new MapFunction
() {
@Override
public Employee call(Row row) throws Exception {
List
list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
return new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3));
}
}, Encoders.bean(Employee.class));
employeeDataset.show();MapPartitions Function
Demonstrates processing a partition (a batch of rows) at once.
Dataset
employeeDataset2 = reader.mapPartitions(new MapPartitionsFunction
() {
@Override
public Iterator
call(Iterator
iterator) throws Exception {
List
employeeList = new ArrayList<>();
while (iterator.hasNext()) {
Row row = iterator.next();
List
list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
employeeList.add(new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3)));
}
return employeeList.iterator();
}
}, Encoders.bean(Employee.class));
employeeDataset2.show();FlatMap Operations
Java Stream flatMap
Maps one input line to two Employee objects.
List
employeeList2 = list.stream().flatMap(word -> {
List
words = Arrays.stream(word.split(",")).collect(Collectors.toList());
List
lists = new ArrayList<>();
Employee e1 = new Employee(words.get(0), Integer.parseInt(words.get(1)), words.get(2), words.get(3));
Employee e2 = new Employee(words.get(0) + "_2", Integer.parseInt(words.get(1)), words.get(2), words.get(3));
lists.add(e1);
lists.add(e2);
return lists.stream();
}).collect(Collectors.toList());
employeeList2.forEach(System.out::println);Spark flatMap
Implements FlatMapFunction to emit multiple Employee objects per row.
Dataset
employeeDatasetFlatmap = reader.flatMap(new FlatMapFunction
() {
@Override
public Iterator
call(Row row) throws Exception {
List
list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
List
employeeList = new ArrayList<>();
employeeList.add(new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3)));
employeeList.add(new Employee(list.get(0) + "_2", Integer.parseInt(list.get(1)), list.get(2), list.get(3)));
return employeeList.iterator();
}
}, Encoders.bean(Employee.class));
employeeDatasetFlatmap.show();GroupBy Operations
Java Stream groupBy
Counts employees per department.
Map
map = employeeList.stream()
.collect(Collectors.groupingBy(Employee::getDepartment, Collectors.counting()));
System.out.println(map);Spark groupBy
Groups by department and computes count and average age.
RelationalGroupedDataset ds = employeeDataset.groupBy("department");
ds.count().show();
ds.avg("age").withColumnRenamed("avg(age)", "avgAge").show();Reduce Operations
Shows how to aggregate ages using reduce in both Java Stream and Spark.
// Java Stream sum of ages
int totalAge = employeeList.stream().mapToInt(e -> e.getAge()).sum();
// Spark reduce to accumulate ages into a single Employee
Employee datasetReduce = employeeDataset.reduce(new ReduceFunction
() {
@Override
public Employee call(Employee t1, Employee t2) throws Exception {
t2.setAge(t1.getAge() + t2.getAge());
return t2;
}
});
System.out.println(datasetReduce);Other Common Operations
Filtering, limiting, sorting, and using temporary tables with SQL.
Employee e = employeeDataset.filter("age > 30").limit(3).sort("age").first();
employeeDataset.registerTempTable("table");
session.sql("select * from table where age > 30 order by age desc limit 3").show();Conclusion
The article provides a simple introduction to Spark operators by leveraging the similarity with Java Stream APIs, encouraging backend developers to try big‑data development locally with only Maven dependencies.
Promotion
The author asks readers to like, follow, share, and collect the article, and promotes a paid knowledge community offering Spring, MyBatis, DDD, RocketMQ and other advanced topics.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.