From Java Streams to Spark: Basic Big Data Operations Explained
This article demonstrates how developers familiar with Java Stream APIs can quickly grasp fundamental Spark operations—including map, flatMap, groupBy, and reduce—by translating stream examples into Spark code, providing complete code snippets, explanations of transformations versus actions, and practical tips for handling exceptions in distributed processing.
This article shows how developers who know Java Stream APIs can quickly learn basic Spark operations by mapping stream examples to Spark code.
Preparation
Test data file contains lines with name, age, department, and position.
张三,20,研发部,普通员工
李四,31,研发部,普通员工
李丽,36,财务部,普通员工
张伟,38,研发部,经理
杜航,25,人事部,普通员工
周歌,28,研发部,普通员工Create an Employee class:
@Getter
@Setter
@AllArgsConstructor
@NoArgsConstructor
@ToString
static class Employee implements Serializable {
private String name;
private Integer age;
private String department;
private String level;
}Map Operations
Java Stream map
List
list = FileUtils.readLines(new File("f:/test.txt"), "utf-8");
List
employeeList = list.stream().map(word -> {
List
words = Arrays.stream(word.split(",")).collect(Collectors.toList());
Employee employee = new Employee(words.get(0), Integer.parseInt(words.get(1)),
words.get(2), words.get(3));
return employee;
}).collect(Collectors.toList());
employeeList.forEach(System.out::println);Spark map
SparkSession session = SparkSession.builder().master("local[*]").getOrCreate();
Dataset
reader = session.read().text("F:/test.txt");
Dataset
employeeDataset = reader.map(new MapFunction
() {
@Override
public Employee call(Row row) throws Exception {
List
list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
return new Employee(list.get(0), Integer.parseInt(list.get(1)),
list.get(2), list.get(3));
}
}, Encoders.bean(Employee.class));
employeeDataset.show();FlatMap Operations
Java Stream flatMap
List
employeeList2 = list.stream().flatMap(word -> {
List
words = Arrays.stream(word.split(",")).collect(Collectors.toList());
List
lists = new ArrayList<>();
Employee e1 = new Employee(words.get(0), Integer.parseInt(words.get(1)),
words.get(2), words.get(3));
lists.add(e1);
Employee e2 = new Employee(words.get(0) + "_2", Integer.parseInt(words.get(1)),
words.get(2), words.get(3));
lists.add(e2);
return lists.stream();
}).collect(Collectors.toList());
employeeList2.forEach(System.out::println);Spark flatMap
Dataset
employeeDatasetFlatmap = reader.flatMap(new FlatMapFunction
() {
@Override
public Iterator
call(Row row) throws Exception {
List
list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
List
employeeList = new ArrayList<>();
Employee e1 = new Employee(list.get(0), Integer.parseInt(list.get(1)),
list.get(2), list.get(3));
employeeList.add(e1);
Employee e2 = new Employee(list.get(0) + "_2", Integer.parseInt(list.get(1)),
list.get(2), list.get(3));
employeeList.add(e2);
return employeeList.iterator();
}
}, Encoders.bean(Employee.class));
employeeDatasetFlatmap.show();GroupBy Operations
Java Stream groupBy
Map
countByDept = employeeList.stream()
.collect(Collectors.groupingBy(Employee::getDepartment, Collectors.counting()));
System.out.println(countByDept);Spark groupBy
RelationalGroupedDataset group = employeeDataset.groupBy("department");
group.count().show();
group.avg("age").withColumnRenamed("avg(age)", "avgAge").show();Reduce Operations
Java Stream reduce
int totalAge = employeeList.stream().mapToInt(e -> e.getAge()).sum();
System.out.println(totalAge);
Employee reduced = employeeList.stream().reduce(new Employee(),
(pre, cur) -> {
if (pre.getAge() == null) {
cur.setAge(cur.getAge());
} else {
cur.setAge(pre.getAge() + cur.getAge());
}
return cur;
});
System.out.println(reduced);Spark reduce
Employee datasetReduce = employeeDataset.reduce(new ReduceFunction
() {
@Override
public Employee call(Employee t1, Employee t2) throws Exception {
t2.setAge(t1.getAge() + t2.getAge());
return t2;
}
});
System.out.println(datasetReduce);Other Common Operations
Employee e = employeeDataset.filter("age > 30").limit(3).sort("age").first();
System.out.println(e);
employeeDataset.registerTempTable("table");
session.sql("select * from table where age > 30 order by age desc limit 3").show();The article concludes that by leveraging the similarity between Java Stream and Spark APIs, beginners can quickly start big‑data development without setting up a full cluster, only needing the Spark Maven dependency.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.